Feature Scaling and its Importance
Feature Scaling is a very important aspect of data preparation for many Machine Learning Algorithms. Let us understand what is feature scaling, why it is important, and when it will be used.
NOTE: For those who are just getting initiated into ML jargon, all the data or variables that are prepared and used as inputs to an ML algorithm are called features.
Why Feature Scaling?
Feature scaling is all about making things comparable. Ensuring one feature does not numerically dominate another feature.
To explain with an analogy, if I were to mix the students from grade 1 to grade 10 for a basketball game, always the taller children from senior classes would dominate the game as they are taller. The junior children would be completely lost and then could be removed from the game as non-contributive. However, if I separate them based on heights, each child can meaningfully contribute to the game. Then, I will know the potential of each child - how well they can play the game - so that I can hone their skills to further heights. This somewhat explains the need for feature scaling.
A similar thing can happen when we have features of varying units or sizes (when they cannot be compared on the same units - like weight in kilograms and distance n kilometres). Typically human height averages around 5' 6" (can vary from population to population and gender as well). But weight is measured in KGs and can vary from say 35KGs to 100 KGs or more. So, if I were to use these two features for any algorithm, there is a good chance that weight contributes very well to the algorithm compared to height like in the basketball game where the taller kids always made the best contribution. If I continued with this, at some point, I may completely eliminate height saying it is insignificant in its contribution But height too is an important feature for any health parameters I want to predict, not just weight. Just that its range is much smaller and gets ignored in the presence of another feature that has numerically large values.
A machine deals with numbers as just numbers and cannot distinguish 5.6 is height and 65 is weight etc. It just knows 65 is way bigger than 5.6 and gives it that much more importance. We need to standardize both to the same comparable "sizes" of data so that we can view each contribution without any bias. That is why you want to scale all of them to similar sizes/magnitudes.
In other words, we need scaling so that one number/feature does not dominate the algorithm just because of its magnitude.
What is Feature Scaling?
From the above, you would have realised that we are trying to make the magnitudes of all the features to similar scales so that no single feature will dominate the algorithm. This is feature scaling.
Types of Feature Scaling
There are many types of feature scaling like
Min Max Scaler
Standard Scaler
Power Transformer Scaler
Unit Vector Scaler
and many more. However, today I will be taking you through the first two which are the most popularly used ones.
Min Max Scaler
The Min-Max Scaling is also known as Normalization.
This scaling is represented by the formula:
If you observe what is happening here, you will notice that when x is min, the numerator is 0 and when x is maximum, the numerator and denominator are the same, making the value 1. Thus, this converts the data into a range between 0 and 1.
This is very sensitive to outliers and can cause huge compression of data due to the presence of some rogue outliers.
Standardisation:
This helps in scaling the data in such a way that the mean(mu) is 0 and the standard deviation (sigma) is 1.
An Example
Let us take an example and see how the data looks. Here we have data about a Health index versus Imports for about 160+ countries. When plotted, the data spread is as shown. The scales for both the features are different. While health ranges from 2 to 18, Imports range from 15 to 180.
So, we would certainly want to scale the two to give equal importance to both by making magnitudes comparable. I have used the MinMaxScaler and StandardScaler and plotted both the scaled version of data in the graph below. You can see there is a linear translation as well as scaling down of the magnitudes to similar ranges.
The orange dots have undergone MinMaxScaling and the grey dots - StandardScaling. The MinMax has scaled the data to between 0 and 1 for both features. The Standard scaler has scaled to have a mean of 0, for both the features along with a standard deviation of 1. But the spreads remain exactly the same as the original data. This way, both the features of health and imports are brought to the same range and hence are comparable and can contribute equally to the Machine Learning Algorithm.
To quickly have a look at some code, how the same is done using python libraries:
...
from sklearn.preprocessing import MinMaxScaler
country = pd.read_csv("CountryHealthImports.csv")
mm_scaler = MinMaxScaler()
country_mm_scaled = pd.DataFrame(mm_scaler.fit_transform(country),
columns=['mm_health', 'mm_imports'])
plt.scatter(x=country.health, y=country.imports)
plt.xlabel("Health Spend Per Capita (as % of GDP)")
plt.ylabel("Imports Spend Per Capita (as % of GDP)")
plt.show()
You are instantiating MinMaxScaler and using its fit_transform method to fit the data that gets its min and max and uses the same for scaling as per the formula already explained. All of this happens in that one step. Then the next few lines can plot the data.
Similar code for using StandardScaler too:
std_scaler = StandardScaler()
country_ss_scaled = pd.DataFrame(std_scaler.fit_transform(country),
columns=['ss_health','ss_imports'])
Just these two lines do the scaling of data.
You can view the full code here:
So what are the advantages of scaling, apart from what was already discussed above? Two important ones are:
Interpretability of the coefficients of the algorithm is greatly improved. The coefficients could vary largely because of the data magnitudes being of different orders. But upon scaling, all coefficients are also within the same scale and become comparable.
The optimization behind the scenes runs much faster. Basically, the gradient descent algorithm runs to minimize a cost function and finding the minimum when all the data is in the same scale becomes much faster.
When to Use Scaling?
Most distance-based algorithms are very sensitive to magnitudes, as they affect the distances calculated. Examples of algorithms that benefit from scaling are K Nearest Neighbours (KNN), K-means and SVM.
however, algorithms like tree-based ones are fairly insensitive to the scale of features, since it just depends on one feature for splitting a node. It is not influenced by other features.
References and Further Reading:
Blog by Sebastian Raschka: About Feature Scaling and normalization
Thanks for a detailed explanation Sai.😀