In machine learning, we seldom get datasets that are perfect from the beginning. Almost it never happens that we get the data and just feed into our machine learning algorithms. In fact, it is one of the final steps. Before we can use the data, we need to act on it. We need to clean, preprocess and handle missing values in the dataset. In this article, we will learn how to handle missing values in machine learning. Handling missing values is one of the most important steps that we need to cover in any machine learning project pipeline.
Why Handle Missing Values?
First of all, missing values cause a lot of problems when using the dataset for any machine learning algorithm.
Even if we set apart the algorithm perspective, missing values are really undesirable. They hinder with data analysis and data visualization. Both of these are very important steps in a machine learning project pipeline.
Now coming back to the machine learning algorithm part. Most of the time we use libraries like Scikit-Learn, Keras, and TensorFlow or any other popular library. Any classic algorithm or neural network cannot handle missing values on their own. We are sure to get an error when giving unclean data into machine learning algorithms.
The above are some of the reasons for handling missing values. Now, let’s move further.
The Dataset
We will be using the Melbourne Housing Market data along the way. You can download the data from Kaggle.
In this dataset, there are two files. One is MELBOURNE_HOUSE_PRICES_LESS.csv
and another is Melbourne_housing_FULL.csv
. We will use Melbourne_housing_FULL.csv
in this article. This file has 34857 instances and 21 columns.
So, let’s start.
Load and Analyze the Data
In this section, we will load the Melbourne_housing_FULL.csv
data and analyze it a bit as it is in the original form.
import pandas as pd data = pd.read_csv('Melbourne_housing_FULL.csv') print(data.shape)
(34857, 21)
We have 34857 instances in the original dataset. There are 21 features.
The following are the feature columns of the dataset that we get after printing the columns using data.columns
.
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'], dtype='object')
Also following is a short head of the data along with some of its columns and values.
Suburb Address Rooms Type Price Method SellerG \ 0 Abbotsford 68 Studley St 2 h NaN SS Jellis 1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 3 Abbotsford 18/659 Victoria St 3 u NaN VB Rounds 4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin Date Distance Postcode ... Bathroom Car Landsize BuildingArea \ 0 3/09/2016 2.5 3067.0 ... 1.0 1.0 126.0 NaN 1 3/12/2016 2.5 3067.0 ... 1.0 1.0 202.0 NaN 2 4/02/2016 2.5 3067.0 ... 1.0 0.0 156.0 79.0 3 4/02/2016 2.5 3067.0 ... 2.0 1.0 0.0 NaN 4 4/03/2017 2.5 3067.0 ... 2.0 0.0 134.0 150.0 ...
This gives us a good amount of perspective on what type of values we can expect.
Now, let’s take a look at the DataFrame description of the data. This will give us a summary of the data columns in the DataFrame. The description will help us anticipate all the statistical values of different columns in the data. This can also provide vital information on how to handle all the missing values in the data.
print(data.describe())
Rooms Price Distance Postcode Bedroom2 \ count 34857.000000 2.724700e+04 34856.000000 34856.000000 26640.000000 mean 3.031012 1.050173e+06 11.184929 3116.062859 3.084647 std 0.969933 6.414671e+05 6.788892 109.023903 0.980690 min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 25% 2.000000 6.350000e+05 6.400000 3051.000000 2.000000 50% 3.000000 8.700000e+05 10.300000 3103.000000 3.000000 75% 4.000000 1.295000e+06 14.000000 3156.000000 4.000000 max 16.000000 1.120000e+07 48.100000 3978.000000 30.000000 Bathroom Car Landsize BuildingArea YearBuilt \ count 26631.000000 26129.000000 23047.000000 13742.00000 15551.000000 mean 1.624798 1.728845 593.598993 160.25640 1965.289885 std 0.724212 1.010771 3398.841946 401.26706 37.328178 min 0.000000 0.000000 0.000000 0.00000 1196.000000 25% 1.000000 1.000000 224.000000 102.00000 1940.000000 50% 2.000000 2.000000 521.000000 136.00000 1970.000000 75% 2.000000 2.000000 670.000000 188.00000 2000.000000 max 12.000000 26.000000 433014.000000 44515.00000 2106.000000 Lattitude Longtitude Propertycount count 26881.000000 26881.000000 34854.000000 mean -37.810634 145.001851 7572.888306 std 0.090279 0.120169 4428.090313 min -38.190430 144.423790 83.000000 25% -37.862950 144.933500 4385.000000 50% -37.807600 145.007800 6763.000000 75% -37.754100 145.071900 10412.000000 max -37.390200 145.526350 21650.000000
We can see that many columns with min
to be zero. Like Bathroom
, Landsize
, YearBuilt
, BuildingArea
, and a few others. But if there is a house data for that row, then those columns cannot have zero values.
This shows potential NaN
values. NaN
s show that the value is missing. But how can we confirm it?
In our dataset, we have two types of data, those are numerical and object data. In object data types there can be string values or categorical values.
Let’s segregate the missing values on the basis of the column data type. First, let’s print all the missing values where the column type is object
. We can use pandas’ dtypes
to do so.
# seggregate missing values by object data type for col in data.columns: if data[col].dtypes == 'object' and data[col].isna().any() == True: print(col)
CouncilArea Regionname
In the above block of code, we have used a simple if
condition to check for object
data type along with columns having Nan
values. It gives us two column names.
Now, we can do the same for finding columns with missing values and having float64
data type. The following block of code accomplishes that.
# seggregate missing values by float64 data type for col in data.columns: if data[col].dtypes == 'float64' and data[col].isna().any() == True: print(col)
Price Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
Looks like our doubt was correct. Many numerical columns along with the ones we analyzed using the DataFrame description have missing values.
Ok, we are ready to start working on the missing values. Keep in mind that there are many ways to handle missing values in data. Sometimes, we have to choose something really specific depending on the dataset.
Handling Missing Values
Deleting Rows
Deleting rows is the easiest step that we can take while handling missing values. In this approach, we simply drop all those rows which contain a NaN
value in any of the columns.
There are some obvious problems associated with this method. Before discussing those, let’s write the code for dropping the rows with missing values.
First, we will create a copy of the CSV file and keep the original file as it is. This is a good practice as we always would like to have an original unaltered file with us.
# creating a copy data_copy1 = data.copy() # drop rows with missing values data_copy1 = data_copy1.dropna(axis=0)
We simply drop all the rows using the dropna()
method. We have passed axis=0
as the argument which corresponds to act on the rows.
Now, coming back to the problem associated with this method. If the count of missing values in the data is very high, then deleting those rows can lead to a significant decrease in instances. Taking a look at the data shape after dropping rows.
print(data_copy1.shape)
(8887, 21)
In the original data, we had around 35000 instances. And after dropping the rows with missing values have only about 9000 instances left. Most of the rows had missing values. This can cause serious issues further on.
We have very few data left to train and test our model. Maybe we even dropped some rows which contained vital information in other columns. With less data, our machine learning model may not be able to learn all the vital features of the data. Therefore, although easy to approach, this method is not very popular to follow in practice.
Missing Value Imputation
One of the most common and better approch is imputation of missing values. Here, we will fill the NaN
values with either the mean, median, or mode of the particular column.
But we will need to follow different approaches for filling numerical columns and categorical columns. Usually, we use mean or median values to fill up numerical values and mode values for categorical values.
Numerical Value Imputation
We will try to handle the imputation of numerical values first. Then we will move to categorical data. To fill up the missing values in a column, we will median value of that particular column.
Before moving further, let’s create a new copy of the data that we can work upon.
data_copy2 = data.copy()
We will create a list containing all the column names that have missing values in them. We will then iterate through the list and replace each of the missing values in the columns with the median value of that column.
The following block of code fills the numerical column missing values with their median.
impute_cols = ['Price', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude', 'Propertycount'] for col in impute_cols: data_copy2[col].fillna(data_copy2[col].median(), inplace=True)
Next up, we have only two categorical columns with missing values. To fill the missing values for those categorical columns, we will use the mode values of those columns.
for col in ['CouncilArea', 'Regionname']: data_copy2[col].fillna(data_copy2[col].mode()[0], inplace=True)
It is always better to see whether we actually have filled up all the NaN
values or not.
print(data_copy2.isna().sum())
Suburb 0 Address 0 Rooms 0 Type 0 Price 0 Method 0 SellerG 0 Date 0 Distance 0 Postcode 0 Bedroom2 0 Bathroom 0 Car 0 Landsize 0 BuildingArea 0 YearBuilt 0 CouncilArea 0 Lattitude 0 Longtitude 0 Regionname 0 Propertycount 0 dtype: int64
The sum of the missing values in all the columns is zero. That is a good sign as it means that there are no missing values in any of the columns.
As of now, we have handled all the missing values in the dataset, both numerical and categorical.
Summary and Conclusion
In this article, you got to know how to handle and impute missing values for data in machine learning. We used only Python and Pandas to carry out all the imputation operations. Our approach is more of a manual method.
More methods for handling missing values in machine learning:
There are many other ways to impute missing values when cleaning data in machine learning. You can use SimpleImputer
from Scikit-Learn which provides a lot of flexibility. And if you want to know the theoretical aspect of missing data, you can visit this: Missing Data by Wikipedia.
I hope that you liked this article. Subscribe to the website to get more content each week. You can also find me on LinkedIn and Twitter.