Monday, April 29, 2019

One Hot Encoding

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.


Lets take a dataset of food names. In this dataset, if there was another food name it would have categorical value as 4.As the no of unique value increases, the categorical values increases.

What is Categorical Data?
  • Categorical data are variables that contain label values rather than numeric values.
  • The number of possible values is often limited to a fixed set.
  • Categorical variables are often called nominal.

Some categories may have a natural relationship to each other, such as a natural ordering.


This involves two steps:
  1. Integer Encoding
  2. One-Hot Encoding

1. Integer Encoding
As a first step, each unique category value is assigned an integer value.

For example, “Apple” is 1, “Chicken” is 2, and “Broccoli” is 3.

This is called a label encoding or an integer encoding and is easily reversible.

For some variables, this may be enough.

The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

2. One-Hot Encoding
For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.