Feature Selection in Machine Learning

Selecting the right features that contribute to your model is an art and a science. I call it art because much pain can be saved if you have good domain knowledge. The importance of a feature is intuitively known already though the quantity of influence may not be known.

It is, of course, science for all the mathematical rigour it goes through for being a selected feature. Today, we will first understand why is feature selection an important aspect of Machine Learning and then, how do we go about selecting the right features.

Why Feature Selection?

Suppose we have 100 variables in our data, as potential features. In that, we want to know what are the best set of features that will give the highest accuracy or precision or any metric that you are looking for. We also want to know which of these even contribute towards predicting the target variable. It will be a lot of trial and error before you can find that out.

One way would be to use a brute force method. Try every combination of variables available and check which predicts best. Implying that we try one variable at a time for all hundred variables, two at a time for all combinations within the 100 variables, three at a time with all combinations and so on. This would lead to 2 to the power of 100 combinations.

Therefore, if we even have just 10 independent variables, it would be 1024 combinations. And if the number of variables increases to 20, the combinations snowball to 1048576. hence this does not sound like an option at all.

Then, how do we go about selecting the features? There are two ways of dealing with it - Manual or automated feature elimination.

As you can guess, Manual Feature elimination is possible and done when the number of variables is very small (say 10 to a max of 15) as it becomes prohibitive to do that when the number of features becomes large. Then you have no choice but to go for automated feature elimination.

Let us look at both of them.

Manual Feature Elimination

As already mentioned, this is possible only when you have fewer variables.

The steps involved are:

Build a model
Check for redundant or insignificant variables
Remove the most insignificant one and go back to step 1

Right. You build the model and then you try to drop those features that are least helpful in the prediction. How do you know that a variable is least helpful? Two factors can be looked at. Either the p-value of all the variables or VIF (Variance Inflation Factor) of all the variables

P-value is a concept that is part of hypothesis testing and inferences. I do not plan to explain that today. Hoping you are aware of it if you have done any regression modelling. But, in a nutshell, you need to know the following about p-value:

P-value is a measure of the probability that an observed difference could have occurred just by random chance.
The lower the p-value, the greater the statistical significance of the observed difference

Therefore, if any variable exhibits a high p-value, i.e. greater than 0.05, typically, you can remove that feature.

To know more about VIF, please refer to my article on Multicollinearity Just to summarise here, VIF gives a measure of how well one of the predictors can be predicted by one or more of the other predictors. This implies that that predictor is redundant. If VIF is high, there is a strong association between them. Hence that predictor can be removed.

Similarly, if you get a VIF of greater than 5 (just a heuristic), that feature can be eliminated as it has great collinearity with some of the other features and hence is redundant.

This process is repeated one variable at a time and the model is rebuilt again. Then, similar checks are made and any other insignificant or redundant variables are removed one by one, till you have only significant variables contributing to the model.

As you can see this is a tedious process. Let us see an example before we go to Automated feature elimination

This example is for predicting house prices given with13 features. A heatmap of the features is shown here:

I start with 1 feature which seems highly correlated with the price i.e. 'Area'.

When I create the linear regression model with just this variable, I get the summary as this:

(This is marked as Step 1 in the code provided in the Jupyter notebook later)

                      OLS Regression Results                            
===================================================================
Dep. Variable:       price   R-squared:                       0.283
Model:                 OLS   Adj. R-squared:                  0.281
Method:      Least Squares   F-statistic:                     149.6
Date:     Mon, 25 May 2020   Prob (F-statistic):           3.15e-29
Time:             09:43:04   Log-Likelihood:                 227.23
No. Observations:      381   AIC:                            -450.5
Df Residuals:          379   BIC:                            -442.6
Df Model:                1                                         
Covariance Type:  nonrobust                                         
===================================================================
      coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------
const 0.1269    0.013      9.853      0.000       0.102       0.152
area  0.4622    0.038     12.232      0.000       0.388       0.536
===================================================================
Omnibus:            67.313   Durbin-Watson:                   2.018
Prob(Omnibus):       0.000   Jarque-Bera (JB):              143.063
Skew:                0.925   Prob(JB):                     8.59e-32
Kurtosis:            5.365   Cond. No.                         5.99
===================================================================

The R-squared value obtained is 0.283. We should certainly improve the value. So we add the second most highly correlated variable, i.e. 'bathrooms'. (This is Step 2 in the code)

Then, the R-squared value improves to 0.480. Adding a third variable 'bedrooms' (Step 3) improves it to 0.505. Then, I have added all the 13 variables (Step 4), the R-squared changes to 0.681. So, clearly, not all variables are contributing in a big way.

Then, I use VIF to check the redundant variables. (code snippet here)

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

The result I get is:

Clearly, we see that the 'bedrooms' has a high VIF, implying that it is explainable by many other variables here. However, I also check the p-values.

In the p-values, I see that 'semi-furnished'' has a very high p-value of 0.938. I drop this and rebuild the model (Step 5)

When I check the p-values and the VIF again, I notice that there is one variable 'bedrooms' that has both a high VIF of 6.6 and a high p-value of 0.206. I choose to drop this then. (Step 6)

Finally, in Step 7, I note that all VIFs are below 5 but 'basement' has a high p-value of 0.03. This is dropped and the model is rebuilt.

This leads us to a place where all remaining features are showing a significant p-value of <. 0.05 and VIFs are < 5.

These remaining 10 features are taken as the selected features for model development.

Here is the Jupyter notebook showing all the steps explained above.

Now, let us see how can we improve with the help of Automated feature elimination.

Automated Feature Elimination

There are multiple ways of automating the feature selection or elimination. Some of the often used methods are:

Recursive Feature Elimination (RFE) - Top n features
Forward, Backward or Stepwise selection - based on selection criteria like AIC, BIC
Lasso Regularization

Where AIC is Akaike Information Criterion and BIC is Bayesian Information Criterion - different criteria that are used for model comparison

We will theoretically look at each of these before I share a code based example for one of these methods.

Recursive Feature Elimination

This is where we give a criterion to select top 'n' features and the n is based on your experience of the domain. It could be the top 15 or 20, totally depending on how many features you think are truly influencing your problem statement. This is clearly an arbitrary number.

Upon giving the features and the 'n' value to the RFE module, the algorithm goes back and forth with all the given features and then comes up with the top n features, that have the maximum influence or are least significant.

Forward, Backward, Stepwise feature selection

Forward Selection is where you pick a variable and build a model. Then, you keep adding a variable and based on the AIC criteria, you keep adding till you don't see any further benefit in adding.

Backward Selection is when you start with all features and you keep removing a variable at a time till you see the metric improves no more.

Stepwise is where you keep adding or removing and trying out till you get a good subset of features that are contributing to your metric

In reality, Stepwise is the popular way though Backward and Stepwise seem to give very similar results. This is all automatically done by libraries that have implemented these already.

Lasso Regularization

This form of regularization makes the coefficients of the redundant features zero. Regularization is a topic that can be looked at in-depth in another article.

Here is the Jupyter Notebook with the same house price prediction example with recursive feature elimination:

Recursive Feature Elimination through Code - Example

A brief explanation here:

I am using the RFE module provided by SciKit Learn. Hence I use the Linear Regression module too from the same library as that is a pre-requisite for RFE to work. Follow through from Step 1, as all other steps before that are preliminary. data preparation steps.

In Step 1, RFE() is passed the model already created as 'lm' and 10, to say I want the top 10 features. It immediately marks the top 10 features as rank 1.

This line helps us see what are the top 10 features

list(zip(X_train.columns,rfe.support_,rfe.ranking_))

I take only these features to start building my model now.

The rest of the steps from Step 2 show how to use the Manual Elimination method after RFE, and this is called the Balanced Approach. What is this approach? I will explain this approach before we go back to understanding the code.

Balanced Approach

This is the most pragmatic approach that employs both types of feature elimination - a combination of Automated and Manual. When you have a large number of features, say 50, 100 or more, you use automated elimination to reduce the total number of features to the top 15 or 20 features and then you use manual elimination to further reduce it to select the truly important features.

The automated method helps in coarse tuning while the manual method helps in fine-tuning the features selected.

Continuing the code from Step 2:

In the code, I used RFE to come to the top 10 features. Once I have got the top 10 features, I go back to building the Linear regression model, checking for p-values and VIF values and then deciding what more needs elimination.

For that, I build a Linear regression model using the statsmodel library as I can get the p-value from the summary provided by this model. (I do not have this option in the LinearRegression module of SKLearn.)

I see that the R-squared value is pretty good at 0.669. However, I see that 'bedrooms' is still insignificant and hence drop that variable in Step 3.

Upon rebuilding that model, I see that there are no high p-values. I check VIF and notice all are below 5. Hence these 9 features are shortlisted as the final set of features for the model.

Conclusion

It is important to use only the features that contribute towards predicting the target variable and hence feature selection or elimination is important. There are many ways of doing it and Recursive feature elimination is one of the automated ways.

Manual feature elimination has been discussed to appreciate the concept of feature elimination but in practical circumstances, it will be rarely used. It is useful only if we have very few features.

There are more advanced techniques of feature elimination like Lasso regularization too.

References:

P-Value definition: https://www.investopedia.com/terms/p/p-value.asp

Decision Trees through an Example

Decision Trees - Feature Selection for a Split

Decision Trees - Homogeneity Measures