Machine Learning algorithms need properly formatted data to give good predictions. Preparing data for a machine learning project can be difficult and can take most of the time (sometimes up to 80%). But it need not be so. In this article, you will learn a very practical approach on how to prepare data for machine learning using Scikit-Learn.
Note: In one of my previous article I discussed the theoretical aspect of preparing data for machine learning. Be sure to give it a look before you move further. It can help a lot.
Why Prepare Data for Machine Learning?
Data preprocessing (preparation) is one of the inevitable steps in machine learning. Seldom you will get the data already in the required format. That’s why it is really important to prepare the data properly before feeding it into your machine learning algorithm. In this regard, Scikit-Learn can help a lot
Let’s check out the different ways.
Standardizing the Data
When you standardize data in machine learning, then you preprocess the data to have a mean of 0 (zero) and unit standard deviation (standard deviation of 1).
Basically the data is changed to have a Gaussian Distribution with zero mean and unit standard deviation.
Scikit-Learn’s StandardScaler class can help to achieve this.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() # define the data data = [[12, 21, 22], [5, 7, 11], [19, 15, 12]] # fit the data to the scaler fit_data = scaler.fit(data) print(fit_data) # transform the data transform_data = scaler.transform(data) print(transform_data)
StandardScaler(copy=True, with_mean=True, with_std=True) [[ 0. 1.16247639 1.40942772] [-1.22474487 -1.27872403 -0.80538727] [ 1.22474487 0.11624764 -0.60404045]]
You can see that the with_mean
and with_std
attributes are set to True
by default. For a sparse matrix, you can also set the mean attribute to False
.
Rescaling the Data
Rescaling of data is another important preprocessing step in machine learning. In most of the cases, when you collect the data, then the attributes will have varying scale. Such data may give adverse results when directly used.
We should therefore normalize / rescale the data so that the values will range from 0 to 1.
We can use the MinMaxScaler class from Scikit-Learn for this.
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() # define the data data = [[12, -21, 22], [5, 7, 11], [-19, 15, 12]] # fit the data to the scaler fit_data = scaler.fit(data) print(fit_data) # transform the data transform_data = scaler.transform(data) print(transform_data)
MinMaxScaler(copy=True, feature_range=(0, 1)) [[1. 0. 1. ] [0.77419355 0.77777778 0. ] [0. 1. 0.09090909]]
After rescaling all the data values are ranging between 0 and 1.
Normalizing the Data
When normalizing data using Scikit-Learn, then each of the samples (each row) is rescaled independently so that its l1
or l2
norm is 1. This is a very common method when dealing with text classification or clustering problems.
Let’s see a simple example.
from sklearn.preprocessing import Normalizer import numpy as np normalizer = Normalizer() # define the data data = [[3, 1, 6, 6], [2, 6, 5, 2], [12, 12, 7, 7]] # fit the data fit_data = normalizer.fit(data) print(fit_data) # normalize the data normalize_data = normalizer.transform(data) np.set_printoptions(precision=1) print(normalize_data)
Normalizer(copy=True, norm='l2') [[0.3 0.1 0.7 0.7] [0.2 0.7 0.6 0.2] [0.6 0.6 0.4 0.4]]
Binarizing the Data
When you binarize the data, then all of the data gets converted either to 0 or 1 depending on the threshold that you give.
Values below the threshold are converted to 0 and above the threshold are converted to 1.
Using Scikit-Learns Binarizer class this process becomes really simple.
from sklearn.preprocessing import Binarizer binarizer = Binarizer(threshold=0.0) # define the data data = [[-1, 1, 4, -9], [2, 3, -3, 7], [0, 1, -2, -1]] # fit the data fit_data = binarizer.fit(data) print(fit_data) # binarize the data binary = binarizer.transform(data) print(binary)
Binarizer(copy=True, threshold=0.0) [[0 1 1 0] [1 1 0 1] [0 1 0 0]]
Now all the values are either 0 or 1 based on the threshold.
Conclusion
In this article, you learned four simple yet very effective methods to prepare the data for machine learning algorithms. I hope that there was at least something to take away from this post. Share, like and subscribe to the newsletter.
You can follow me on Twitter and Facebook to get regular updates about future posts. I am on LinkedIn as well.