Intro To Machine Learning (ML)

Afeef Ansari
Dec 2, 2020
5 min read

Updated: Jan 3, 2021

Intro To Machine Learning

ML systems learn how to combine input to produce useful predictions on never-before-seen data. The label is the variable we're predicting is typically represented by the variable y. Features are input variables describing our data are typically represented by the variables {x1, x2, ..., xn}. A label is a thing we're predicting—the y variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything. A feature is an input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features, specified as: {x1, x2, ..., xn}. In the spam detector example, the features could include the following: words in the email text, sender's address, time of day the email was sent, the email contains the phrase "one weird trick." An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We break examples into two categories: labeled examples and unlabeled examples. A labeled example includes both feature(s) and the label. That is labeled examples: {features, label}: (x, y). Models

A model defines the relationship between features and labels. For example, a spam detection model might associate certain features strongly with "spam". Let's highlight two phases of a model's life: Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and labels. Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y'). For example, during inference, you can predict medianHouseValue for new unlabeled examples.

Regression vs. classification

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following: What is the value of a house in California? What is the probability that a user will click on this ad? A classification model predicts discrete values. For example, classification models make predictions that answer questions like the following: Is a given email message spam or not spam? Is this an image of a dog, a cat, or a hamster?

So now we are going to talk about linear regression. Using the paragraph above, you know that a regression model predicts continuous values. In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). Here’s an example: It has long been known that crickets (an insect species) chirp more frequently on hotter days than on cooler days. For decades, professional and amateur scientists have cataloged data on chirps-per-minute and temperature. As a birthday gift, your Aunt Ruth gives you her cricket database and asks you to learn a model to predict this relationship. Using this data, you want to explore this relationship.

First, examine your data by plotting it:

Figure 1. Chirps per Minute vs. Temperature in Celsius.

As expected, the plot shows the temperature rising with the number of chirps. Is this relationship between chirps and temperature linear? Yes, you could draw a single straight line like the following to approximate this relationship:

Figure 2. A linear relationship.

True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this relationship as follows: y = mx+b. The variable y is the temperature in Celsius—the value we're trying to predict. The variable m is the slope of the line, while the x is the number of chirps per minute—the value of our input feature. Lastly, the variable b is the y-intercept. Knowing more in ML, you will learn to write the equation differently, which looks like y^1 = b + w1x1. The variable y^1 is the predicted label (the desired output). B is the bias (the y-intercept), sometimes referred to as w0. w1s the weight of feature 1. Weight is the same concept as the "slope" m in the traditional equation of a line. x1 is a feature (a known input). To infer (predict) the temperature y′ for a new chirps-per-minute value x1, just substitute the x1 value into this model. Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might look as follows: y′=b+w1x1+w2x2+w3x3.

Training and Loss

Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.

Loss is the penalty for a bad prediction. That is, the loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples. For example, Figure 3 shows a high loss model on the left and a low loss model on the right. Note the following about the figure:

The blue lines represent predictions.
The arrows represent a loss.

Figure 3. High loss in the left model; low loss in the right model.

Notice that the arrows in the left plot are much longer than their counterparts in the right plot. Clearly, the line in the right plot is a much better predictive model than the line in the left plot.

You might be wondering whether you could create a mathematical function—a loss function—that would aggregate the individual losses in a meaningful fashion.

Squared loss: a popular loss function

The linear regression models we'll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows: = the square of the difference between the label and the prediction

= (observation - prediction(x))2

= (y - y')2

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples: MSE = 1/N Σ (x,y) ED (y-prediction(x))^2. Where: (x,y) is an example in which: X is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions. Y is the example's label (for example, temperature). Prediction(x) is a function of the weights and bias in combination with the set of features. D is a data set containing many labeled examples, which are(x,y) pairs. N is the number of examples in D. Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

Intro To Machine Learning (ML)

Regression vs. classification

Training and Loss

Squared loss: a popular loss function

Recent Posts

Comments

Subscribe Form