[Avg. reading time: 4 minutes]
Reinforcement Learning
RLHF (Reinforcement Learning Human Feedback)
Its like humans learning todo and not todo.
A learning paradigm where an agent interacts with an environment, takes actions, and learns from reward signals.
Instead of labeled data, it uses trial-and-error feedback.
Complements supervised/unsupervised learning.
Strongly linked to decision-making and control tasks.
Example: YT recommends a video, if you watch it system understands that, if you choose don’t show this system reacts to that.
Here the agent is YT recommendation engine, action: user watching or ignoring the video. Rewards like/share or not-interested.
Pros
- Handles complex sequential decisions.
- Can learn optimal strategies without explicit rules.
- Mimics human/animal learning.
Cons
- Data and compute intensive.
- Reward design is tricky.
- Training can be unstable.
Use Cases
- Game AI: AlphaGo defeating world champions.
- Robotics: teaching robots to walk, grasp, or navigate.
- Finance: algorithmic trading strategies.
- Dynamic pricing in e-commerce.
flowchart TD
A[Prompt] --> B[Base LLM generates multiple responses]
B --> C[Human labelers rank responses]
C --> D[Reward Model learns preferences]
D --> E[Fine-tune LLM with Reinforcement Learning]
E --> F[Aligned ChatGPT]