One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

**CATEGORICAL DATA**

Lets take a dataset of food names. In this dataset, if there was another food name it would have categorical value as 4.As the no of unique value increases, the categorical values increases.

**What is Categorical Data?**

- Categorical data are variables that contain label values rather than numeric values.
- The number of possible values is often limited to a fixed set.
- Categorical variables are often called nominal.

Some categories may have a natural relationship to each other, such as a natural ordering.

**CONVERT CATEGORICAL DATA INTO NUMERICAL DATA**

This involves two steps:

- Integer Encoding
- One-Hot Encoding

**1. Integer Encoding**

As a first step, each unique category value is assigned an integer value.

For example, “Apple” is 1, “Chicken” is 2, and “Broccoli” is 3.

This is called a

**label encoding**or an**integer encoding**and is easily reversible.
For some variables, this may be enough.

The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

**2. One-Hot Encoding**

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

## No comments:

## Post a Comment