Alignment evaluation: RLHF, DPO, and recent alternatives

Balanza antigua sobre fondo neutro simbolizando equilibrio y ajuste

When OpenAI described RLHF as the technique behind InstructGPT in 2022, the landscape of language-model alignment seemed conceptually solved. RLHF was expensive but worked, and for a time it was the default method for any serious fine-tuning with human preferences. Three years later, that monopoly has faded: DPO showed you could achieve similar results with much less effort, and since then variants (KTO, ORPO, SimPO) promise to simplify things further.

This post is a practical review of the current state: when each method makes sense, what it costs, what results it produces in real contexts, and what criteria to use for choosing. Not a theoretical intro, but applied reading for those building an alignment pipeline who have to make decisions.

The problem they all solve

The common goal is the same: start from a pre-trained and instruction-tuned model, and adjust it to respond according to preferences. Preferences can be “answer helpfully”, “don’t produce harmful content”, “adopt this style”, “prioritize brevity”, whatever fits the case.

All methods start from the same input: pairs of responses labeled “better” and “worse” according to some criterion (human preference, oracle comparison, or programmatic evaluation). What changes between methods is how that dataset is used to update the model’s weights.

RLHF: the classic

RLHF first trains a reward model predicting how much a human will like a given response. Then uses reinforcement learning (typically PPO) to optimize the language model by maximizing that reward, with KL divergence constraints to prevent drifting too far from the base model.

Pros: the most studied method, produces high-quality results, and the reward model can evaluate new responses outside training.

Cons: computationally expensive (three active models during training: policy, reference, reward), unstable (hyperparameter tuning is notoriously hard), and requires a large, high-quality preference dataset to work well.

When to use it: when you have resources (ML research team, compute budget, time to iterate), when you want a reusable reward model for offline evaluation, or when the alignment criterion is complex and you need an explicit reward model representing it.

DPO: the simplification

DPO (Direct Preference Optimization) arrived in 2023 with a simple idea: you can reformulate the RLHF problem such that the implicit reward model is the language model itself, and then you can train directly against preference pairs without PPO, without a separate reward model, without the complexity.

In practice, DPO works surprisingly well. The resulting model’s quality is very close to RLHF on most benchmarks, and training cost is a fraction. This has made DPO the most common method in the open community today: practically all public fine-tunes of Llama, Mistral, or Gemma use it.

Pros: much simpler to implement, much faster to train, less sensitive to hyperparameters, works well with smaller datasets.

Cons: doesn’t produce a reusable reward model, is slightly worse than well-tuned RLHF on very complex tasks, and tends to produce longer responses than the base model (a well-documented bias).

When to use it: for most standard alignment cases, especially if you’re in an open-source environment, with moderate resources and a modest preference dataset.

Recent variants

Over the last year, several DPO alternatives have appeared trying to improve one or another aspect.

KTO (Kahneman-Tversky Optimization) borrows from behavioral economics reasoning: instead of preference pairs, it uses responses individually labeled as desirable or undesirable. The advantage is that datasets are easier to build (no comparisons needed, just labels), and result quality is comparable to DPO in many cases. If your data source is naturally binary (good vs bad responses without pairing), KTO fits better.

ORPO (Odds Ratio Preference Optimization) fuses the initial supervised fine-tuning and alignment into one step, eliminating the prior SFT phase. This simplifies the pipeline and, in some papers, delivers results equal to or better than SFT+DPO. Interesting for those wanting shorter pipelines.

SimPO (Simple Preference Optimization) proposes a reformulation of DPO using sequence-length normalization as reference, eliminating the need for the reference model. This reduces memory cost during training. Results are promising but community adoption is still modest.

What happens in practice

In teams I’ve worked with or whose publications I’ve followed, the repeated pattern is this: for standard alignment of a mid-sized model (7B to 70B) with a reasonable human-preference dataset, DPO is the default and there’s no strong reason to change.

KTO gets adopted when the dataset comes in binary format (for example, production user feedback with approved and rejected answers but no explicit pairs). ORPO gets tried when simplifying the pipeline is the goal, and it works particularly well when combined with instruction data of the same style as preferences. SimPO shows up more in papers than in production.

RLHF is reserved for very specific cases: frontier models where the last percentage points matter a lot, cases needing a reusable reward model for continuous evaluation, or when the alignment criterion is complex and multimodal.

Concrete recommendations

If you’re starting an alignment project today, my pragmatic suggestion is:

Start with DPO. It has the most documentation, tutorials, and public implementations. Any decent framework (HuggingFace’s TRL, Axolotl, Unsloth) gets you to a working pipeline in a day.

Use a preference dataset of at least a few thousand examples. If you have fewer, KTO can be more efficient because binary labeling is cheaper.

Evaluate with a held-out validation set, using automated metrics (ROUGE, BLEU) and random human evaluation. Pure automated evaluation is misleading, especially with DPO which tends to inflate length.

Only move to RLHF if, after trying DPO, you have concrete evidence of quality shortfall for your specific case. In most projects that evidence doesn’t show up.

And try a recent variant (KTO, ORPO) if your dataset or pipeline fits its premises better. Don’t do it for fashion, but because the method genuinely fits your scenario.

My read

What I sum up after three years of evolution in this space is that the barrier to effectively aligning an open model has crashed. What in 2022 was a research-lab project is today weekend fine-tuning with a moderate GPU budget. That doesn’t mean the problem is conceptually solved (there are important open questions about safe alignment of frontier models), but that the routine practice of adjusting a model to the style or task you need is now accessible to any technical team.

The existence of new variants every few months is a sign the space is alive. Not all methods will survive, and probably within two years we’ll remember several fondly as instructive experiments rather than production techniques. But the general trend, toward less complexity and more accessibility, is clearly good for the ecosystem.

Entradas relacionadas