[Avg. reading time: 13 minutes]

Data Encoding

Data Encoding is the process of converting categorical data (like colors, countries, product types) into a numeric format that ML models can understand.

Unlike numerical data, categorical data is not directly usable because models operate on numbers, not labels.

Encoding ensures categorical values are represented in a way that preserves meaning and avoids misleading the model.

Typically rule-based.

Example: Products

ID	Product
1	Laptop
2	Phone
3	Tablet
4	Phone

Label Encoding

Assigns each category a unique integer.

ID	Product (Encoded)
1	0
2	1
3	2
4	1

Pros:

Very simple, minimal storage.
Works well for tree-based models.

Cons:

Implies an order between categories (Laptop < Phone < Tablet).
Misleads linear models.

One-Hot Encoding

Creates a binary column for each category.

ID	Laptop	Phone	Tablet
1	1	0	0
2	0	1	0
3	0	0	1
4	0	1	0

Pros:

No ordinal assumption.
Easy to interpret.

Cons:

High dimensionality for many products (e.g., thousands of SKUs).
Sparse data, more memory needed.

Ordinal Encoding

Encodes categories when they have a natural order.

Works for things like product size or version level.

Example (Product Tier):

ID	Product Tier
1	Basic
2	Standard
3	Premium
4	Standard

After Ordinal Encoding:

ID	Product Tier (Encoded)
1	1
2	2
3	3
4	2

Pros:

Preserves rank/order.
Efficient storage.

Cons:

Only valid if order is real (Basic < Standard < Premium).
Wrong if categories are unordered (Laptop vs Phone).

Target Encoding (Mean Encoding)

Replaces each category with the mean of the target variable.

Target - “Purchased” Yes=1, No=0

ID	Product	Purchased
1	Laptop	1
2	Phone	0
3	Tablet	1
4	Phone	1

ID	Product (Encoded)	Purchased
1	1.0	1
2	0.5	0
3	1.0	1
4	0.5	1

Compute mean purchase rate:

Laptop = 1.0 Phone = 0.5 Tablet = 1.0

Pros:

Great for high-cardinality features (e.g., hundreds of product SKUs).
Often improves accuracy.
Keeps dataset compact (just 1 numeric column).
Often boosts performance in models like Logistic Regression or Gradient Boosted Trees.

Cons:

Risk of data leakage if target encoding is done on the whole dataset.
Must use cross-validation to avoid leakage.
Compute intensive.

Encoding Type	Best For	Avoid When
Label Encoding	Tree-based models, low-cardinality products	Linear models, unordered categories
One-Hot Encoding	General ML, few product categories	Very high-cardinality features
Ordinal Encoding	Ordered categories (tiers, sizes, versions)	Unordered categories (Phone vs Laptop)
Target Encoding	High-cardinality products, with proper CV	Without CV (leakage risk)

Multiple Categorical Columns

ID	Product	Product Tier	Category	Purchased
1	Laptop	Premium	PC	1
2	Phone	Basic	Mobile	0
3	Tablet	Standard	Electronics	1
4	Phone	Premium	Mobile	1

Product: Laptop, Phone, Tablet
Product Tier: Basic < Standard < Premium (ordered)
Category: Electronics, Accessories, Clothing (unordered)

Label Encoding (all columns)

Replace each category with an integer.

ID	Product	Product Tier	Category
1	0	2	0
2	1	0	1
3	2	1	2
4	1	2	1

Artificial order created (e.g., PC=0, Mobile=1, Electronics=2).

One-Hot Encoding (all columns)

ID	Laptop	Phone	Tablet	Tier_Basic	Tier_Standard	Tier_Premium	Cat_PC	Cat_Mobile	Cat_Electronics
1	1	0	0	0	0	1	1	0	0
2	0	1	0	1	0	0	0	1	0
3	0	0	1	0	1	0	0	0	1
4	0	1	0	0	0	1	0	1	0

Very interpretable, but column explosion if you have 50+ products or 100+ categories.

Mixed Encoding (best practice)

Product → One-Hot (few categories).
Product Tier → Ordinal (Basic=1, Standard=2, Premium=3).
Category → One-Hot (PC, Mobile, Electronics).

ID	Laptop	Phone	Tablet	Tier (Ordinal)	Cat_PC	Cat_Mobile	Cat_Electronics
1	1	0	0	3	1	0	0
2	0	1	0	1	0	1	0
3	0	0	1	2	0	0	1
4	0	1	0	3	0	1	0

#onehot_encoding #target_encoding #label_encodingVer 0.3.6