[Avg. reading time: 13 minutes]

Data Encoding

Data Encoding is the process of converting categorical data (like colors, countries, product types) into a numeric format that ML models can understand.

Unlike numerical data, categorical data is not directly usable because models operate on numbers, not labels.

Encoding ensures categorical values are represented in a way that preserves meaning and avoids misleading the model.

Typically rule-based.


Example: Products

IDProduct
1Laptop
2Phone
3Tablet
4Phone

Label Encoding

Assigns each category a unique integer.

IDProduct (Encoded)
10
21
32
41

Pros:

  • Very simple, minimal storage.
  • Works well for tree-based models.

Cons:

  • Implies an order between categories (Laptop < Phone < Tablet).
  • Misleads linear models.

One-Hot Encoding

Creates a binary column for each category.

IDLaptopPhoneTablet
1100
2010
3001
4010

Pros:

  • No ordinal assumption.
  • Easy to interpret.

Cons:

  • High dimensionality for many products (e.g., thousands of SKUs).
  • Sparse data, more memory needed.

Ordinal Encoding

Encodes categories when they have a natural order.

Works for things like product size or version level.

Example (Product Tier):

IDProduct Tier
1Basic
2Standard
3Premium
4Standard

After Ordinal Encoding:

IDProduct Tier (Encoded)
11
22
33
42

Pros:

  • Preserves rank/order.
  • Efficient storage.

Cons:

  • Only valid if order is real (Basic < Standard < Premium).
  • Wrong if categories are unordered (Laptop vs Phone).

Target Encoding (Mean Encoding)

Replaces each category with the mean of the target variable.

Target - “Purchased” Yes=1, No=0

IDProductPurchased
1Laptop1
2Phone0
3Tablet1
4Phone1
IDProduct (Encoded)Purchased
11.01
20.50
31.01
40.51

Compute mean purchase rate:

Laptop = 1.0 Phone = 0.5 Tablet = 1.0

Pros:

  • Great for high-cardinality features (e.g., hundreds of product SKUs).
  • Often improves accuracy.
  • Keeps dataset compact (just 1 numeric column).
  • Often boosts performance in models like Logistic Regression or Gradient Boosted Trees.

Cons:

  • Risk of data leakage if target encoding is done on the whole dataset.
  • Must use cross-validation to avoid leakage.
  • Compute intensive.

Encoding TypeBest ForAvoid When
Label EncodingTree-based models, low-cardinality productsLinear models, unordered categories
One-Hot EncodingGeneral ML, few product categoriesVery high-cardinality features
Ordinal EncodingOrdered categories (tiers, sizes, versions)Unordered categories (Phone vs Laptop)
Target EncodingHigh-cardinality products, with proper CVWithout CV (leakage risk)

Multiple Categorical Columns

IDProductProduct TierCategoryPurchased
1LaptopPremiumPC1
2PhoneBasicMobile0
3TabletStandardElectronics1
4PhonePremiumMobile1
  • Product: Laptop, Phone, Tablet
  • Product Tier: Basic < Standard < Premium (ordered)
  • Category: Electronics, Accessories, Clothing (unordered)

Label Encoding (all columns)

Replace each category with an integer.

IDProductProduct TierCategory
1020
2101
3212
4121

Artificial order created (e.g., PC=0, Mobile=1, Electronics=2).

One-Hot Encoding (all columns)

IDLaptopPhoneTabletTier_BasicTier_StandardTier_PremiumCat_PCCat_MobileCat_Electronics
1100001100
2010100010
3001010001
4010001010

Very interpretable, but column explosion if you have 50+ products or 100+ categories.

Mixed Encoding (best practice)

  • Product → One-Hot (few categories).
  • Product Tier → Ordinal (Basic=1, Standard=2, Premium=3).
  • Category → One-Hot (PC, Mobile, Electronics).
IDLaptopPhoneTabletTier (Ordinal)Cat_PCCat_MobileCat_Electronics
11003100
20101010
30012001
40103010

#onehot_encoding #target_encoding #label_encodingVer 0.3.6

Last change: 2025-12-02