One Hot Encode in Machine Learning


Page Heading Image

Data preparation or processing is one of the most important steps when working with real-world data on a machine learning project.

One of the major pains in such situations is working with categorical data. This is because most of the machine learning algorithms cannot work with categorical data directly. They are needed to be converted to numerical data. One-Hot Encoding of the data is a very good solution to handle categorical data.

In this article, we will see what One-Hot Encoding is and where to use one-hot encoding.

Categorical Data

So, what is categorical data actually?

Simply speaking, in categorical data the values are labels instead of numbers.

Take the following case for example.

When you want to categorize ‘salary’ in a data set, then you may label it as ‘low’, medium’, ‘high’.

A table showing salary category
Table showing salary category

Similarly, if you want to label ‘shape‘ of objects, you could do something like, ’round’, ‘square’, ‘triangular‘.

The Problem with Categorical Data

You may be thinking that the above situation of categorizing values as labels seems fair enough and reasonable. Actually, you are right. But, when you bring the case of machine learning algorithms, then the situation changes.

Most of the machine learning algorithms out there cannot handle categorical labels in a data set directly. Whether it may be for classification or regression, the algorithms specifically need numerical data to carry out the predictions.

So, we need to convert the categorical labels into numerical labels. In the Machine Learning world, this is often termed as data transformation.

Now, let us see the different ways in which categorical labels can be handled.

Handling Categorical Labels

We will focus mainly on two methods here.

  • Label Encoding
  • One-Hot Encoding

1. Label Encoding

Label encoding is really simple thing. For each of the categorical label you assign an integer to it.

If we again consider the salary example, the you will be able to encode low as 1, medium as 2 and high as 3.

Example of label encoding
Example of label encoding

This process is okay until the number of labels is considerably small. When the number of labels increases, this solution may not work very well.

This brings us to the second technique, One-Hot Encoding.

2. One-Hot Encoding

In one-hot encoding, the numerical variables are replaced by binary variables.

So, each of the category is either 0 or 1. Again, take the shape category example into account. If the shape is, say, ‘triangle‘, then it is labeled as 1 and all other shapes are labeled as zero.

One-hot encoding is particularly used in those cases where there is no ordinal relationship between the labels.

The following image may clear some things up.

Example of One-Hot Encoding
Example of One-Hot Encoding

This technique really helps when the category labels are not related and even if there are numerous labels.

See these articles for more knowledge:

Conclusion

In this article, you learned about handling categorical labels in machine learning. I hope that you could get some knowledge out of it. If you have any thoughts, then comment in the comment section. Follow me on Twitter to get updates on articles.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

3 thoughts on “One Hot Encode in Machine Learning”

  1. sana says:

    Everything went well until the training phase where we should have Cuda installed or it prompts to ask for no NVIDIA driver. Could you please add those installation steps as well or put in the blog that cuda needs to be installed before starting this project.

    1. Sovit Ranjan Rath says:

      Hello sana. I am a bit confused because there is no training phase in this tutorial. Are you referring to some other tutorial and posted the comment here by mistake?

Leave a Reply

Your email address will not be published. Required fields are marked *