[Avg. reading time: 6 minutes]
Feature Engineering
The process of transforming raw data into more informative inputs (features) for ML models.
Goes beyond encoding: you can create new features/metrics (like derived columns in the DB world) that pure encoding does not offer.
The goal of FE is to improve model accuracy, interpretability, and generalization.
Example (Laptop Sales):
Purchase Date = 2025-09-02
Derived Features:
- Month = 09
- DayOfWeek = Tuesday
- IsHolidaySeason = No
- IsWeekend = No
- IsLeapYear= No
- Quarter = Q3
Encoding (One-Hot, Label, Target) = only turns categories into numbers.
But real-world data often hides useful patterns in dates, interactions, domain knowledge, or semantics.
| ID | Product | Purchase Date | Price | PurchasedAgain |
|---|---|---|---|---|
| 1 | Laptop | 2023-12-01 | 1200 | 1 |
| 2 | Laptop | 2024-07-15 | 1100 | 0 |
| 3 | Phone | 2024-05-20 | 800 | 1 |
| 4 | Tablet | 2024-08-05 | 600 | 1 |
- Encoding only handles Product → One-Hot or Target.
Feature Engineering adds new insights:
- From Purchase Date: extract Month, DayOfWeek, IsHolidaySeason.
- From Price: create Discounted? (if < avg product price).
- Combine features: Price / AvgCategoryPrice.
Basic Feature Engineering
Improve signals/patterns without domain-specific knowledge.
Scaling/Normalization: Price → (Price – mean) / std
Date/Time Features: Purchase Date → Month=12, DayOfWeek=Friday
Polynomial/Interaction: Price × Tier
Pros:
- Easy to implement.
- Immediately boosts many models (especially linear/Neural Networks).
Cons:
- Risk of adding noise if done blindly.
- Limited unless combined with domain insights.
Domain-Specific Feature Engineering
Apply business/field knowledge.
Examples:
Finance: Debt-to-Income Ratio, Credit Utilization %
Healthcare: BMI = Weight / Height², risk score categories
IoT: Rolling averages, peak detection in sensor data.
Pros:
- Captures real-world meaning → big performance gains.
- Makes models explainable to stakeholders.
Cons:
- Requires domain expertise.
- Not always transferable between datasets.