Intro To Machine Learning Part 4.

Afeef Ansari
Jan 17, 2021
6 min read

Updated: Jan 18, 2021

Hey! This is part 4 of the Intro To Machine Learning articles, so go read those first 3 articles to understand more about this article. The first thing we’re going to cover is Training and Test Sets: Splitting Data. A training set is a subset to train a model. A test set—a subset to test the trained model. You could imagine slicing the single data set as follows:

Figure 1. Slicing a single data set into a training set and test set.

Make sure that your test set meets the following two conditions: Is large enough to yield statistically meaningful results. And is representative of the data set as a whole. In other words, don't pick a test set with different characteristics than the training set. Assuming that your test set meets the preceding two conditions, your goal is to create a model that generalizes well to new data. Our test set serves as a proxy for new data. For example, consider the following figure. Notice that the model learned for the training data is very simple. This model doesn't do a perfect job—a few predictions are wrong. However, this model does about as well on the test data as it does on the training data. In other words, this simple model does not overfit the training data.

Figure 2. Validating the trained model against test data.

Never train on test data. If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. For example, high accuracy might indicate that test data has leaked into the training set. For example, consider a model that predicts whether an email is a spam, using the subject line, email body, and sender's email address as features. We apportion the data into training and test sets, with an 80-20 split. After training, the model achieves 99% precision on both the training set and the test set. We'd expect a lower precision on the test set, so we take another look at the data and discover that many of the examples in the test set are duplicates of examples in the training set (we neglected to scrub duplicate entries for the same spam email from our input database before splitting the data). We've inadvertently trained on some of our test data, and as a result, we're no longer accurately measuring how well our model generalizes to new data.

Validation Set

So now we have this powerful test and training set methodology. So let’s imagine we’re using it in practice. We’ve got our test set, we’ve got our training set, we did a good job separating them out. And we’re gonna do some iterations now. I'm going to train a model on my training data, I'm gonna test it on my test data and I'm gonna observe its metrics. I’m gonna try to test it and approve my accuracy and then do that again. Maybe I can add some features, maybe take them out. I keep iterating and iterating until the best possible model I can, based on my test set metrics. And you can ask yourself, are there any problems here? Well, one thing I could imagine is that I’m maybe starting to overfit to the quirk of my test data. Welp, that's too bad. So, here’s another way to handle that. I can create a third data set out of my partitions, I'm gonna call this my "validation data". And I'm gonna use a new, slightly augmented, iterative approach, where I'm going to do my iterations by training on my training data and then evaluating only on my validation data. Keeping my test data way off to the side and completely unused.

I'm gonna iterate and iterate, tweaking whatever parameters or making whatever changes I want to my model until I get very good results on my validation data. I'm then, and only then, going to test my model on the final test data. And I'm gonna make sure that the results that I'm getting on the test data basically match what I'm getting on my validation data. If they don’t, that’s a pretty good signal that maybe you were over-fitting to the validation set.

The previous paragraphs introduced partitioning a data set into a training set and a test set. This partitioning enabled you to train on one set of examples and then to test the model against a different set of examples. With two partitions, the workflow could look as follows:

Figure 1. A possible workflow? In the figure, "Tweak model" means adjusting anything about the model you can dream up—from changing the learning rate to adding or removing features, to designing a completely new model from scratch. At the end of this workflow, you pick the model that does best on the test set. Dividing the data set into two sets is a good idea, but not a panacea. You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure:

Figure 2. Slicing a single data set into three subsets. Use the validation set to evaluate results from the training set. Then, use the test set to double-check your evaluation after the model has "passed" the validation set. The following figure shows this new workflow:

Figure 3. A better workflow. In this improved workflow, you should: Pick the model that does best on the validation set. And also, double-check that model against the test set.

This is a better workflow because it creates fewer exposures to the test set.

Representation: Feature Engineering

Normally in coding, the focus is on the code. However, while working on machine learning, the focus changes to representation. A machine learning model can’t hear, see, or sense input examples. But, you can create a representation of the data to provide the model with a useful advantage in the data’s key qualities. To train a model, you must choose the set of features that best represent that data. We are going to talk about mapping raw data to features. Raw data, also known as primary data, is data collected from a source. In the context of examinations, the raw data might be described as a raw score. Input data is a computer file that contains data that serves as input to a device or program. The left side of Figure 1 illustrates raw data from an input data source; the right side illustrates a feature vector, which is the set of floating-point values comprising the examples in your data set. Feature engineering means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering. Many machine learning models must represent the features as real-numbered vectors since the feature values must be multiplied by the model weights.

Next we will be talking about mapping categorical values. Categorical features have a discrete set of possible values. For example, there might be a feature called street_name with options that include: {‘Charleston Road’, ‘North Shoreline Boulevard’, ‘Shorebird Way’, ‘Rengstorff Avenue’}. Since models cannot multiply strings by the learned weights, we use feature engineering to convert strings to numeric values. We can accomplish this by defining a mapping from the feature values, which we'll refer to as the vocabulary of possible values, to integers. Since not every street in the world will appear in our dataset, we can group all other streets into a catch-all "other" category, known as an OOV (out-of-vocabulary) bucket.

Using this approach, here's how we can map our street names to numbers: map Charleston Road to 0, map North Shoreline Boulevard to 1, map Shorebird Way to 2, map Rengstorff Avenue to 3, map everything else (OOV) to 4. However, if we incorporate these index numbers directly into our model, it will impose some constraints that might be problematic. We'll be learning a single weight that applies to all streets. For example, if we learn a weight of 6 for street_name, then we will multiply it by 0 for Charleston Road, by 1 for North Shoreline Boulevard, 2 for Shorebird Way, and so on. Consider a model that predicts house prices using street_name as a feature. It is unlikely that there is a linear adjustment of price based on the street name, and furthermore, this would assume you have ordered the streets based on their average house price.

Our model needs the flexibility of learning different weights for each street that will be added to the price estimated using the other features. We aren't accounting for cases where street_name may take multiple values. For example, many houses are located at the corner of two streets, and there's no way to encode that information in the street_name value if it contains a single index. To remove both these constraints, we can instead create a binary vector for each categorical feature in our model that represents values as follows. For values that apply to the example, set corresponding vector elements to 1. Also, set all other elements to 0. The length of this vector is equal to the number of elements in the vocabulary. This representation is called a one-hot encoding when a single value is 1, and a multi-hot encoding when multiple values are 1.

Intro To Machine Learning Part 4.

Recent Posts

Comments

Subscribe Form