Hyperparameter Tuning in Random Forests


Banner Image

Random Forests are powerful ensemble machine learning algorithms that can perform both classification and regression. In machine learning, random forests work quite well in large and complex datasets. They can give high accuracy score. But we can improve these results even further. Therefore, in this article, we will learn how to perform hyperparameter tuning in random forests.

What will you learn in this article?
1. How to use Random Forest Regressor in Scikit-Learn?
2. Predicting chance of graduate admission using the Graduate Admission dataset from Kaggle.
3. How to perform Random Search to get the best parameters for random forests.

Note: If you want to get a bit more familiar with the working of Random Forests, then you can visit one of my previous articles. That can help you to get started with Random Forests.

Why do We Need Hyperparameter Tuning?

In random forests, there are a number of hyperparameters available. Although we can get good results without any changes to these parameters, there are some parameters which have great impact on the output of our classifier or regressor. But we do not want to manually search and test for the optimal values of these hyperparameters. Such a trial and error method may take a lot of time.

Therefore, we have methods like RandomizedSearchCV and GridSearchCV which help us to fine-tune the hyperparameters by providing us with the best values. On top of that, we can implement these using Scikit-Learn as well.

Now, let’s get started. Open your python notebook and be sure to download the dataset from here first. If you want, you can also work within the Kaggle notebook itself.

Loading the Dataset

First, we have to import some general modules and load the dataset as well.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score
np.random.seed(42)

We are also defining a random seed in the above code snippet for the reproducibility of code.

The next line of code loads the dataset.

# on local machine
train = pd.read_csv('Admission_Predict_Ver1.1.csv')

If you using the Kaggle kernel, then you may need to load the dataset into the memory by using the following code.

train = pd.read_csv('../input/graduate-admissions/Admission_Predict_Ver1.1.csv')

Analyze and Prepare the Data

In this section, we will check whether there any missing values in the data that we need to take care of. We will also prepare the training and validation set.

First, let’s start by checking the data types. That will ensure if we need to deal with any categorical data or not.

# check data types
print(train.dtypes)
Serial No.             int64
GRE Score              int64
TOEFL Score            int64
University Rating      int64
SOP                  float64
LOR                  float64
CGPA                 float64
Research               int64
Chance of Admit      float64
dtype: object

All the features are either in integer format or in float format. So, we do not need to worry about any categorical conversion here.

Next, we can check whether any of the features contain any missing values. We can use isna() method for that.

# check for missing values
print(train.isna().sum())
Serial No.           0 
GRE Score            0 
TOEFL Score          0 
University Rating    0 
SOP                  0 
LOR                  0 
CGPA                 0 
Research             0 
Chance of Admit      0 
dtype: int64

As we can see, there are no missing values in any of the features. We can proceed further and convert each of the numerical value into float64 format. Currently, we have all the numerical values as either float64 or int64. But it will be much better if we use only one data type for the features.

# convert to float type
train = train.astype(float)

This covers almost all of the data analysis and preprocessing part.

Before we apply the Random Forest algorithm to our data, we need to split the data into a training and testing set. We will need the test data set for evaluation purposes.

To split the data between train and test set, we will be using train_test_split() module from Scikit-Learn. We will use 80% of the data for training and 20% for testing our model. Let’s write the code for obtaining the train and test set.

# train-test split
x_train_new = train.drop(['Serial No.', 'Chance of Admit '], axis=1)
y_train_new = train['Chance of Admit ']

x_train, x_test, y_train, y_test = train_test_split(x_train_new, y_train_new, train_size=0.8)
print(x_train.shape)
print(x_test.shape)
(400, 7)
(100, 7)

After splitting, we have 400 samples in the training set and 100 samples in the testing set.

Random Forest Regressor

As we need to predict the percentage of chance for graduate admission, we will use RandomForestRegressor(). This is because percentage is going to be a real value for which the regression algorithm needs to be applied.

First, let’s import and initialize the RandomForestRegressor() from Scikit-Learn.

from sklearn.ensemble import RandomForestRegressor 
rf = RandomForestRegressor(random_state=42, n_jobs=-1)

Baseline Algorithm

Before applying the Randomized Search for our data, first, we can check how a baseline algorithm performs without any parameter tuning. This will give us a good idea of whether our model is performing better or worse with parameter tuning.

We need to fit our algorithm to data. The next line of code does that.

rf = rf.fit(x_train, y_train)

When we do not apply any hyperparameter tuning, then random forest uses the default parameters for fitting the data. We can check those parameter values by using get_params.

print(rf.get_params)
<bound method BaseEstimator.get_params of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False)>

The above are the parameters for the base estimator. If we predict on the test set now using rf, then these hyperparameter values will be used. It is a good idea to obtain a baseline prediction first so that we can monitor the difference when using RandomizedSearchCV.

The following code predicts the new values on the test set.

predictions = rf.predict(x_test)

We will use explained_variance_score to get the score for our predicted values. In explained_variance_score, 0.0 is the worst value and 1.0 is the best value. So, it will be better to get the score as close to 1.0 as possible.

score = explained_variance_score(y_test, predictions)
print(score)
0.7772544938875305

The score is around 0.77 which can be considered reasonably good. Still, let’s try RandomizedSearchCV and see whether we can improve the score.

Randomized Search

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

In the above code block, we imported the RandomizedSearchCV and randint module from Scikit-Learn and Scipy respectively. The randint module will help to initialize random values for the parameters that we want to specify. It will be clearer after we define the parameter distribution that we want to use in our randomized search.

Let’ write the code to create the parameter distribution.

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, x_train.shape[1]),
              "min_samples_split": sp_randint(2, 11),
              "bootstrap": [True, False],
              "n_estimators": sp_randint(100, 500)}

The above code block we have the following parameters:
max_depth: this specifies the depth of the tree that will be formed.
max_features: the total number of features to consider. This is going to be a random value between 1 and the maximum features that we have. In our case, the maximum number can be 7.
min_samples_split: this specifies the minimum number of samples to consider for each split. It will be a random value between 2 and 11.
bootstrap: whether or not to use bootstrap samples when building trees. If this is False, then the whole dataset will be used.
n_estimators: the number of trees to use for building the random forest.

Now, as we are done with the parameter distribution, we are ready to initialize and use RandomizedSearchCV on the training set.

random_search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                   n_iter=10, cv=5, iid=False, random_state=42)
random_search.fit(x_train, y_train)

First, we initialize RandomizedSearchCV which will run for 10 iterations with 5-fold cross-validation. Then we fit it on the training dataset.

Printing the best parameters, we get,

print(random_search.best_params_)
{'bootstrap': True, 'max_depth': 3, 'max_features': 6, 'min_samples_split': 6, 'n_estimators': 357}

When we will predict the values using random_search, then automatically the above parameters will be used. Let’s see whether we get a better score or not than the baseline prediction.

y_preds = random_search.predict(x_test)
print(explained_variance_score(y_test, y_preds))
0.8112329468704592

This time the score is 0.81 which is a big improvement than the baseline prediction. Looks like our random search cross-validation worked really well.

If you still want improvement, you can go for Grid Search next. Most of the time it helps to improve the prediction scores even further.

Summary and Conclusion

I hope that you learned how to perform a random search cross-validation in this article. If so, then consider sharing this with others as well.

If you liked this article, then subscribe to the website to get timely updates about the new article. You can follow me on LinkedIn and Twitter as well.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

4 thoughts on “Hyperparameter Tuning in Random Forests”

  1. Shubham Kumar Koul says:

    In param_dist , after bootstrap , there is one extra bracket. Remove that to not get an error.

    1. Sovit Ranjan Rath says:

      Thank you Subham for pointing that out. It’s corrected now. Recently, I have updated all the code editor’s code. That must have crept in at the time.

  2. TOM SOSTHENES ONYINKWA says:

    THANK YOU SO MUCH

    1. Sovit Ranjan Rath says:

      Welcome Tom.

Leave a Reply

Your email address will not be published. Required fields are marked *