Intro to Machine Learning Pt. 2(or continued)
So in the previous article in Machine Learning finished talking about loss. you'll learn how a machine learning model iteratively reduces loss. Iterative learning is sort of like the game “Hot and Cold” in which you have to find a hidden object. The “hidden object” is the best possible model. You'll start with a wild guess ("The value of w1 is 0.") and wait for the system to tell you what the loss is. Then, you'll try another guess ("The value of w1 is 0.5.") and see what the loss is. If you play this game right, you'll usually be getting warmer. The real trick to the game is trying to find the best possible model as efficiently as possible. The following figure suggests the iterative trial-and-error process that machine learning algorithms use to train a model:
data:image/s3,"s3://crabby-images/4130a/4130a19fd821674ef4823c35e8eedf65873b26de" alt=""
Figure 1. An iterative approach to training a model.
We'll use this same iterative approach throughout the Machine Learning articles. Iterative strategies are prevalent in machine learning, primarily because they scale so well to large data sets. The "model" takes one or more features as input and returns one prediction (y′) as output. To simplify, consider a model that takes one feature and returns one prediction: y1 = b+w1x1. What initial values should we set b and w1? For linear regression problems, it turns out starting values aren’t important. We could pick random values, but we’ll just take the following trivial values instead: b=0, w1=0. Suppose that the first feature value is 10. Plugging that feature value into the prediction function yields: y′=0+0⋅10=0. The "Compute Loss" part of the diagram is the loss function that the model will use. Suppose we use the squared loss function. The loss function takes in two input values:
y′
: The model's prediction for features x
y
: The correct label corresponding to features x.
At last, we've reached the "Compute parameter updates" part of the diagram. It is here that the machine learning system examines the value of the loss function and generates new values for b and w1 For now, just assume that this mysterious box devises new values and then the machine learning system re-evaluates all those features against all those labels, yielding a new value for the loss function, which yields new parameter values. And the learning continues iterating until the algorithm discovers the model parameters with the lowest possible loss. Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged. The iterative approach diagram (Figure 1) contained a green hand-wavy box entitled "Compute parameter updates." We'll now replace that algorithmic fairy dust with something more substantial.
Suppose we had the time and the computing resources to calculate the loss for all possible values of w1. For the kind of regression problems we've been examining, the resulting plot of loss vs. w1 will always be convex. In other words, the plot will always be bowl-shaped, kind of like this:
data:image/s3,"s3://crabby-images/3dd6d/3dd6d147e2ff91255227b930a44e388490e6b464" alt=""
Figure 2. Regression problems yield convex loss vs. weight plots.
Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.
Calculating the loss function for every conceivable value of w1 over the entire data set would be an inefficient way of finding the convergence point. Let's examine a better mechanism—very popular in machine learning—called gradient descent. The first stage in gradient descent is to pick a starting value (a starting point) for w1. The starting point doesn't matter much; therefore, many algorithms simply set w1 to 0 or pick a random value. The following figure shows that we've picked a starting point slightly greater than 0:
Figure 3. A starting point for gradient descent.
The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights. Note that a gradient is a vector, so it has both of the following characteristics:
a direction
a magnitude
The gradient always points in the direction of the steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible. The gradient descent then repeats this process, edging ever closer to the minimum.
As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.
Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. If you pick a learning rate that is too small, learning will take too long.
Figure 6. The learning rate is too small.
Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong:
Figure 7. The learning rate is too large.
There's a Goldilocks learning rate for every regression problem. The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in larger step size.
Figure 8. The learning rate is just right.
The ideal learning rate in one-dimension is 1f(x)″ (the inverse of the second derivative of f(x) at x). The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives). The story for general convex functions is more complex.
data:image/s3,"s3://crabby-images/33896/338961045223129074984a8d88798844061b778d" alt=""
Comments