Prepare Data for Machine Learning using Scikit-Learn



Machine Learning algorithms need properly formatted data to give good predictions. Preparing data for a machine learning project can be difficult and can take most of the time (sometimes up to 80%). But it need not be so. In this article, you will learn a very practical approach on how to prepare data for machine learning using Scikit-Learn.

Note: In one of my previous article I discussed the theoretical aspect of preparing data for machine learning. Be sure to give it a look before you move further. It can help a lot.

Why Prepare Data for Machine Learning?

Data preprocessing (preparation) is one of the inevitable steps in machine learning. Seldom you will get the data already in the required format. That’s why it is really important to prepare the data properly before feeding it into your machine learning algorithm. In this regard, Scikit-Learn can help a lot

Let’s check out the different ways.

Standardizing the Data

When you standardize data in machine learning, then you preprocess the data to have a mean of 0 (zero) and unit standard deviation (standard deviation of 1).

Basically the data is changed to have a Gaussian Distribution with zero mean and unit standard deviation.

Scikit-Learn’s StandardScaler class can help to achieve this.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# define the data
data = [[12, 21, 22], 
        [5, 7, 11], 
        [19, 15, 12]]

# fit the data to the scaler
fit_data = scaler.fit(data)
print(fit_data)

# transform the data
transform_data = scaler.transform(data)
print(transform_data)
StandardScaler(copy=True, with_mean=True, with_std=True) 
[[ 0.          1.16247639  1.40942772]  [-1.22474487 -1.27872403 -0.80538727]  [ 1.22474487  0.11624764 -0.60404045]] 

You can see that the with_mean and with_std attributes are set to True by default. For a sparse matrix, you can also set the mean attribute to False.

Rescaling the Data

Rescaling of data is another important preprocessing step in machine learning. In most of the cases, when you collect the data, then the attributes will have varying scale. Such data may give adverse results when directly used.

We should therefore normalize / rescale the data so that the values will range from 0 to 1.

We can use the MinMaxScaler class from Scikit-Learn for this.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# define the data
data = [[12, -21, 22], 
        [5, 7, 11], 
        [-19, 15, 12]]

# fit the data to the scaler
fit_data = scaler.fit(data)
print(fit_data)

# transform the data
transform_data = scaler.transform(data)
print(transform_data)
MinMaxScaler(copy=True, feature_range=(0, 1))
[[1.         0.         1.        ]  
[0.77419355 0.77777778 0.        ]  
[0.         1.         0.09090909]] 

After rescaling all the data values are ranging between 0 and 1.

Normalizing the Data

When normalizing data using Scikit-Learn, then each of the samples (each row) is rescaled independently so that its l1 or l2 norm is 1. This is a very common method when dealing with text classification or clustering problems.

Let’s see a simple example.

from sklearn.preprocessing import Normalizer

import numpy as np

normalizer = Normalizer()

# define the data
data = [[3, 1, 6, 6], 
        [2, 6, 5, 2], 
        [12, 12, 7, 7]]
        
# fit the data
fit_data = normalizer.fit(data)
print(fit_data)

# normalize the data
normalize_data = normalizer.transform(data)
np.set_printoptions(precision=1)
print(normalize_data)
Normalizer(copy=True, norm='l2') 
[[0.3 0.1 0.7 0.7]  
 [0.2 0.7 0.6 0.2]  
 [0.6 0.6 0.4 0.4]] 

Binarizing the Data

When you binarize the data, then all of the data gets converted either to 0 or 1 depending on the threshold that you give.

Values below the threshold are converted to 0 and above the threshold are converted to 1.

Using Scikit-Learns Binarizer class this process becomes really simple.

from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0)

# define the data
data = [[-1, 1, 4, -9], 
        [2, 3, -3, 7], 
        [0, 1, -2, -1]]

# fit the data
fit_data = binarizer.fit(data)
print(fit_data)

# binarize the data
binary = binarizer.transform(data)
print(binary)
Binarizer(copy=True, threshold=0.0) 
[[0 1 1 0]  
[1 1 0 1]  
[0 1 0 0]]

Now all the values are either 0 or 1 based on the threshold.

Conclusion

In this article, you learned four simple yet very effective methods to prepare the data for machine learning algorithms. I hope that there was at least something to take away from this post. Share, like and subscribe to the newsletter.
You can follow me on Twitter and Facebook to get regular updates about future posts. I am on LinkedIn as well.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *