Data preparation or processing is one of the most important steps when working with real-world data on a machine learning project.
One of the major pains in such situations is working with categorical data. This is because most of the machine learning algorithms cannot work with categorical data directly. They are needed to be converted to numerical data. One-Hot Encoding of the data is a very good solution to handle categorical data.
In this article, we will see what One-Hot Encoding is and where to use one-hot encoding.
Categorical Data
So, what is categorical data actually?
Simply speaking, in categorical data the values are labels instead of numbers.
Take the following case for example.
When you want to categorize ‘salary’ in a data set, then you may label it as ‘low’, medium’, ‘high’.
Similarly, if you want to label ‘shape‘ of objects, you could do something like, ’round’, ‘square’, ‘triangular‘.
The Problem with Categorical Data
You may be thinking that the above situation of categorizing values as labels seems fair enough and reasonable. Actually, you are right. But, when you bring the case of machine learning algorithms, then the situation changes.
Most of the machine learning algorithms out there cannot handle categorical labels in a data set directly. Whether it may be for classification or regression, the algorithms specifically need numerical data to carry out the predictions.
So, we need to convert the categorical labels into numerical labels. In the Machine Learning world, this is often termed as data transformation.
Now, let us see the different ways in which categorical labels can be handled.
Handling Categorical Labels
We will focus mainly on two methods here.
- Label Encoding
- One-Hot Encoding
1. Label Encoding
Label encoding is really simple thing. For each of the categorical label you assign an integer to it.
If we again consider the salary example, the you will be able to encode low as 1, medium as 2 and high as 3.
This process is okay until the number of labels is considerably small. When the number of labels increases, this solution may not work very well.
This brings us to the second technique, One-Hot Encoding.
2. One-Hot Encoding
In one-hot encoding, the numerical variables are replaced by binary variables.
So, each of the category is either 0 or 1. Again, take the shape category example into account. If the shape is, say, ‘triangle‘, then it is labeled as 1 and all other shapes are labeled as zero.
One-hot encoding is particularly used in those cases where there is no ordinal relationship between the labels.
The following image may clear some things up.
This technique really helps when the category labels are not related and even if there are numerous labels.
See these articles for more knowledge:
Conclusion
In this article, you learned about handling categorical labels in machine learning. I hope that you could get some knowledge out of it. If you have any thoughts, then comment in the comment section. Follow me on Twitter to get updates on articles.
Everything went well until the training phase where we should have Cuda installed or it prompts to ask for no NVIDIA driver. Could you please add those installation steps as well or put in the blog that cuda needs to be installed before starting this project.
Hello sana. I am a bit confused because there is no training phase in this tutorial. Are you referring to some other tutorial and posted the comment here by mistake?