Intro To Machine Learning #5

Afeef Ansari
Jan 31, 2021
3 min read

Updated: Feb 1, 2021

This is part 5 of the Intro To Machine Learning articles, so go read those first 4 articles to understand more about this article. We are going to resume from the last article. The last article was talking about training sets and introduced Representation: Feature Engineering. Intro To Machine Learning #4 ended with this paragraph: Our model needs the flexibility of learning different weights for each street that will be added to the price estimated using the other features. We aren't accounting for cases where street_namemay takes multiple values. For example, many houses are located at the corner of two streets, and there's no way to encode that information in the street_name value if it contains a single index. To remove both these constraints, we can instead create a binary vector for each categorical feature in our model that represents values as follows. For values that apply to the example, set corresponding vector elements to 1. Also, set all other elements to 0. The length of this vector is equal to the number of elements in the vocabulary. This representation is called a one-hot encoding when a single value is 1, and a multi-hot encoding when multiple values are 1. Following this, Figure 3 represents a one-hot encoding of a particular street: Shorebird Way. The element in the binary vector for Shorebird Way has a value of 1, while the elements for all other streets have values of 0.

Figure 3. Mapping street address via one-hot encoding.

This approach effectively creates a Boolean variable for every feature value (e.g., street name). Here, if a house is on Shorebird Way then the binary value is 1 only for Shorebird Way. Thus, the model uses only the weight for Shorebird Way. Similarly, if a house is at the corner of two streets, then two binary values are set to 1, and the model uses both their respective weights.

We've explored ways to map raw data into suitable feature vectors, but that's only part of the work. We must now explore what kinds of values actually make good features within those feature vectors. Avoid rarely used discrete feature values. Good feature values should appear more than 5 or so times in a data set. Doing so enables a model to learn how this feature value relates to the label. That is, having many examples with the same discrete value gives the model a chance to see the feature in different settings, and in turn, determine when it's a good predictor for the label. For example, a house_type feature would likely contain many examples in which its value was victorian: house_type: victorian. Conversely, if a feature's value appears only once or very rarely, the model can't make predictions based on that feature. For example, unique_house_id is a bad feature because each value would be used only once, so the model couldn't learn anything from it: unique_house_id: 8KSHFSIJFSL134ZZZ.

You should prefer clear and obvious meanings. Each feature should have a clear and obvious meaning to anyone on the project. For example, the following good feature is clearly named and the value makes sense with respect to the name: house_age_years: 27. Conversely, the meaning of the following feature value is pretty much indecipherable to anyone but the engineer who created it: house_age: 8514782000. In some cases, noisy data (rather than bad engineering choices) causes unclear values. For example, the following user_age_years came from a source that didn't check for appropriate values: user_age_years: 277.

Don’t mix “magic” values with actual data. Good floating-point features don't contain peculiar out-of-range discontinuities or "magic" values. For example, suppose a feature holds a floating-point value between 0 and 1. So, values like the following are fine: quality_rating: 0.82 | quality_rating: 0.37. However, if a user didn't enter a quality_rating perhaps the data set represented its absence with a magic value like the following: quality_rating: -1. To explicitly mark magic values, create a Boolean feature that indicates whether or not a quality_rating was supplied. Give this Boolean feature a name like quality_rating_defined. In the original feature, replace the magic values as follows: For variables that take a finite set of values (discrete variables), add a new value to the set and use it to signify that the feature value is missing. For continuous variables, ensure missing values do not affect the model by using the mean value of the feature's data.

Intro To Machine Learning #5

Recent Posts

Comments

Subscribe Form