Machine Learning : 101

Machine learning is the branch of science that deals with the machine’s ability to learn from existing data, then performing some task based on the learning and using some performance indicators to measure how accurately the task is being performed.

For example: playing chess. The experience of playing many games of chess is the learning part. The task is of playing chess and the probability that the program will win the next game is the performance indicator.

In general, any machine learning problem can be assigned to one of two broad classifications: Supervised learning and Unsupervised learning.

Supervised Learning

The fundamental concepts of supervised learning, a prominent branch of machine learning, wherein the label or outcome of the target variable is known in advance. Their are two key types of supervised learning: regression and classification. Regression pertains to scenarios where the target variable exhibits a continuous nature, while classification deals with cases where the target variable assumes categorical or discrete values.

In regression, the target variable is continuous. An example is the prediction of house prices based on various features such as area, number of rooms, presence of lawn or pool, etc., as a representative case. On the other hand, classification the target variable is in form categories or discrete values. An example is the task of predicting the likelihood of an individual developing diabetes in the future, considering factors such as blood pressure, glucose levels, insulin levels, among others.

By analyzing these examples, it becomes evident that in both regression and classification, the target variable to be predicted is already known. This understanding of the target variable facilitates the formulation and implementation of appropriate mathematical models for prediction and analysis. Highlighting the significance of regression and classification as distinct paradigms. The clarity in knowing the target variable in advance establishes a solid foundation for conducting effective predictive modeling in a wide range of applications.

Why do we split the data into test and train data while building a supervised learning model?

The goal of machine learning is to predict well on new data drawn from a (hidden) true probability distribution. Unfortunately, the model we are building at present can't see the whole truth; the model can only sample from an available dataset. If a model fits the current examples well, how can we trust the model to make good predictions on never-before-seen examples?

One way is to divide your data set into two subsets:

  1. Training set: a subset to train a model.

  2. Test set: a subset to test the model.

Separating the data enables you to evaluate your model generalization capabilities and have an idea of how it would perform on unseen data. Good performance on the test set is a useful indicator of good performance on the new data in general, assuming that:

  • The samples are drawn independently and at random from the distribution to create the test set.

  • The test set is large enough.


When and how do you bring in test data?

Initially, the dataset that is provided for analysis is split into train and test sets. The test set should be such that it is representative of the population on which the model is going to make predictions.

Unsupervised Learning

Building a mathematical model using data that contains only inputs and no desired outputs.

•Used to find structure in the data, like grouping or clustering of data points. To discover patterns and group the inputs into categories.

•Example: an advertising platform segments the population into smaller groups with similar demographics and purchasing habits. Helping advertisers reach their target market with relevant ads.

Since no labels are provided, there is no specific way to compare model performance in most unsupervised learning methods.

Previous
Previous

Reduced Form Models

Next
Next

Wall Street’s Top Shelf