Feature Scaling in Machine Learning

Rahull Trehan
4 min readJun 7, 2021

In this article, I would like to cover one of the most important but most forgotten part of Machine Learning, that is, Feature Scaling.

While we are working on our data preparation or EDA (Exploratory Data Analysis) we kind of focus a lot on all variables/features we have, we treat these features for outliers or missing values, we probably combine multiple features to create a new one as part of our data extraction technique, but one aspect which we always leave while working on our EDA part is that of Feature Scaling.

Feature Scaling in very simple words is adjusting the data that has different scales, and this adjustment is done, in order to avoid bias that may be there due to the difference in the scale of the two features.

Let’s take an example — We have three friends, where two of the friends know their shirt size and we want to predict the shirt size of the third friend. The features/variables we have of these three friends are height (in feet) and weight (in kilograms).

In order to find out the shirt size of the third friend, let's assume, that the size of a shirt of a person is dependent on the sum of height and weight of that person and whoever the sum of height and weight of the third friend is closer to we would have that size as the outcome. So, in our case the sum of height and weight of the three friends are:

Now, looking at the above outcome we can easily conclude that the shirt size of the third friend would be Small (S) since it is much more closer to that of Friend #2 as compared to Friend #1. But if you think logically you may have a different opinion about this outcome.

Simply due to the fact that we assumed that the size of the shirt is dependent on the sum of height and weight of a person and for the fact that we did not consider the scale and range of the two features we are getting the outcome which seems incorrect.

If we look at the feature weight it is ranging from 55 to 78 whereas the height is ranging from 5.4 to 6.1 and hence no matter what if we just add them up the weight will always have the dominance on our outcome.

Now, this problem can easily be tackled through the feature scaling technique, where we will adjust these features/variables so that they do not have an impact due to differences in scale.

The two most common techniques for feature scaling are:

  • Normalization — transforms the data in the range of 0 to 1 depending on the min and max values in the range.
  • Standardization — transforms the data to have zero mean and a unit standard deviation and may not have a fixed range between 0 and 1 after transformation.

In our example, let’s use the normalization technique for feature scaling and the new normalized height and weight of each of the friend would be:

And now, applying the same assumption we would get a different outcome wherein the output result would be much closer to Friend #1 and hence the prediction outcome for shirt size for Friend #3 would be Large (L).

Now since we have seen that how normalizing the data or feature scaling can help us have better results. Let us understand how feature scaling has an impact on various machine learning algorithms.

The two of the machine learning algorithm types where we would have a direct impact with a feature scaling technique would be the distance-based algorithms and the gradient descent-based algorithms.

Gradient Descent-based algorithms like linear regression, logistic regression, neural network, etc., that use gradient descent to optimize the result need the data to be scaled. Since different features may have different scales they will eventually have a different step size of gradient descent and in order to move the gradient descent smoothly, we need to scale the data.

Similarly, for the distance-based algorithms like KNN (K Nearest Neighbour), K-means, SVM (Support Vector Machines), etc., that uses the distance between the data points to determine the output would require the data to be scaled. If we input unscaled data to distance-based algorithms it may have the same impact of a particular variable being dominant like we had in our example above.

Tree-based algorithms like decision tree are insensitive to feature scaling since it is only splitting a node based on a single feature and hence scaling does not have any impact.

Hope you liked this article. Thank you for reading.

--

--