[Avg. reading time: 13 minutes]
Data Encoding
Data Encoding is the process of converting categorical data (like colors, countries, product types) into a numeric format that ML models can understand.
Unlike numerical data, categorical data is not directly usable because models operate on numbers, not labels.
Encoding ensures categorical values are represented in a way that preserves meaning and avoids misleading the model.
Typically rule-based.
Example: Products
| ID | Product |
|---|---|
| 1 | Laptop |
| 2 | Phone |
| 3 | Tablet |
| 4 | Phone |
Label Encoding
Assigns each category a unique integer.
| ID | Product (Encoded) |
|---|---|
| 1 | 0 |
| 2 | 1 |
| 3 | 2 |
| 4 | 1 |
Pros:
- Very simple, minimal storage.
- Works well for tree-based models.
Cons:
- Implies an order between categories (Laptop < Phone < Tablet).
- Misleads linear models.
One-Hot Encoding
Creates a binary column for each category.
| ID | Laptop | Phone | Tablet |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 |
| 4 | 0 | 1 | 0 |
Pros:
- No ordinal assumption.
- Easy to interpret.
Cons:
- High dimensionality for many products (e.g., thousands of SKUs).
- Sparse data, more memory needed.
Ordinal Encoding
Encodes categories when they have a natural order.
Works for things like product size or version level.
Example (Product Tier):
| ID | Product Tier |
|---|---|
| 1 | Basic |
| 2 | Standard |
| 3 | Premium |
| 4 | Standard |
After Ordinal Encoding:
| ID | Product Tier (Encoded) |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 2 |
Pros:
- Preserves rank/order.
- Efficient storage.
Cons:
- Only valid if order is real (Basic < Standard < Premium).
- Wrong if categories are unordered (Laptop vs Phone).
Target Encoding (Mean Encoding)
Replaces each category with the mean of the target variable.
Target - “Purchased” Yes=1, No=0
| ID | Product | Purchased |
|---|---|---|
| 1 | Laptop | 1 |
| 2 | Phone | 0 |
| 3 | Tablet | 1 |
| 4 | Phone | 1 |
| ID | Product (Encoded) | Purchased |
|---|---|---|
| 1 | 1.0 | 1 |
| 2 | 0.5 | 0 |
| 3 | 1.0 | 1 |
| 4 | 0.5 | 1 |
Compute mean purchase rate:
Laptop = 1.0 Phone = 0.5 Tablet = 1.0
Pros:
- Great for high-cardinality features (e.g., hundreds of product SKUs).
- Often improves accuracy.
- Keeps dataset compact (just 1 numeric column).
- Often boosts performance in models like Logistic Regression or Gradient Boosted Trees.
Cons:
- Risk of data leakage if target encoding is done on the whole dataset.
- Must use cross-validation to avoid leakage.
- Compute intensive.
| Encoding Type | Best For | Avoid When |
|---|---|---|
| Label Encoding | Tree-based models, low-cardinality products | Linear models, unordered categories |
| One-Hot Encoding | General ML, few product categories | Very high-cardinality features |
| Ordinal Encoding | Ordered categories (tiers, sizes, versions) | Unordered categories (Phone vs Laptop) |
| Target Encoding | High-cardinality products, with proper CV | Without CV (leakage risk) |
Multiple Categorical Columns
| ID | Product | Product Tier | Category | Purchased |
|---|---|---|---|---|
| 1 | Laptop | Premium | PC | 1 |
| 2 | Phone | Basic | Mobile | 0 |
| 3 | Tablet | Standard | Electronics | 1 |
| 4 | Phone | Premium | Mobile | 1 |
- Product: Laptop, Phone, Tablet
- Product Tier: Basic < Standard < Premium (ordered)
- Category: Electronics, Accessories, Clothing (unordered)
Label Encoding (all columns)
Replace each category with an integer.
| ID | Product | Product Tier | Category |
|---|---|---|---|
| 1 | 0 | 2 | 0 |
| 2 | 1 | 0 | 1 |
| 3 | 2 | 1 | 2 |
| 4 | 1 | 2 | 1 |
Artificial order created (e.g., PC=0, Mobile=1, Electronics=2).
One-Hot Encoding (all columns)
| ID | Laptop | Phone | Tablet | Tier_Basic | Tier_Standard | Tier_Premium | Cat_PC | Cat_Mobile | Cat_Electronics |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| 4 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
Very interpretable, but column explosion if you have 50+ products or 100+ categories.
Mixed Encoding (best practice)
- Product → One-Hot (few categories).
- Product Tier → Ordinal (Basic=1, Standard=2, Premium=3).
- Category → One-Hot (PC, Mobile, Electronics).
| ID | Laptop | Phone | Tablet | Tier (Ordinal) | Cat_PC | Cat_Mobile | Cat_Electronics |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 3 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 | 2 | 0 | 0 | 1 |
| 4 | 0 | 1 | 0 | 3 | 0 | 1 | 0 |