Reinforcement Learning from Human Feedback (RLHF) examples: Direct Preference Optimization (DPO) #513

danilopeixoto · 2024-03-01T12:42:57Z

Introduce one Reinforcement Learning from Human Feedback (RLHF) example, such as Direct Preference Optimization (DPO) method.

Paper

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Notes

Direct Preference Optimization (DPO): A Simplified Explanation by João Lages

Implementation examples

Possible MLX implementation

Policy and reference log probabilities:

def get_batched_logps(model, inputs, targets):
    logits, _ = model(inputs)
    logits = logits.astype(mx.float32)

    loss_mask = targets != 0
    per_token_logps = mx.take_along_axis(nn.log_softmax(logits), targets[..., None], axis=2).squeeze(2)

    return tuple((per_token_logps * loss_mask).sum(-1).split(2))

Loss:

def dpo_loss(model, beta, label_smoothing, reference_chosen_logps, reference_rejected_logps, inputs, targets):
    chosen_logps, rejected_logps = get_batched_logps(model, inputs, targets)

    pi_logratios = chosen_logps - rejected_logps
    reference_logratios = reference_chosen_logps - reference_rejected_logps

    logits = pi_logratios - reference_logratios
    losses = -nn.log_sigmoid(beta * logits) * (1.0 - label_smoothing) - nn.log_sigmoid(-beta * logits) * label_smoothing

    chosen_rewards = beta * (chosen_logps - reference_chosen_logps)
    rejected_rewards = beta * (rejected_logps - reference_rejected_logps)
    reward_accuracies = (chosen_rewards > rejected_rewards).astype(mx.float32)
    reward_margins = chosen_rewards - rejected_rewards

    ntoks = (inputs != 0).sum()

    return (
        losses.mean(),
        chosen_rewards.mean(),
        rejected_rewards.mean(),
        reward_accuracies.mean(),
        reward_margins.mean(),
        ntoks,
    )

Beta: The temperature parameter for the DPO loss is typically set in the range of 0.1 to 0.5. The reference model is ignored when beta equals 0.

Label smoothing: This parameter represents the conservativeness for DPO loss, assuming that preferences are noisy and can be flipped with a probability of label_smoothing.

Note label_smoothing > 0 defines the Conservative DPO loss.

The text was updated successfully, but these errors were encountered:

awni · 2024-03-01T14:14:49Z

@danilopeixoto I've been thinking about having this in MLX LM recently. Any interest in sending a PR?

It might make to do it after we have a more manageable config (#503) but that should be landed soon!

awni · 2024-03-01T14:16:00Z

To be more concrete, I'm envisioning you just set the loss in the config. e.g. cross_entropy or dpo

ivanfioravanti · 2024-03-19T11:43:24Z

This would be an awesome addition to mlx_examples! 🔥

N8python · 2024-03-26T15:48:53Z

I'm very very excited for this! Don't have the technical expertise to implement the DPO directly but would love to help in other ways (config, code cleanup) if neccessary!

lin72h · 2024-03-27T02:50:12Z

That makes MLX really useful for production not just a research tool!

kishoretvk · 2024-04-11T11:41:20Z

+500 waiting for this

developerlin · 2024-05-16T15:38:37Z

Wait for this, when will the DPO training be supported?

Fixes ml-explore#513 Implement the Direct Preference Optimization (DPO) method as a Reinforcement Learning from Human Feedback (RLHF) example. * **Add DPO Functions**: Add `get_batched_logps` and `dpo_loss` functions to `llms/mlx_lm/utils.py` for DPO implementation. * **Update Training Logic**: Update `llms/mlx_lm/tuner/trainer.py` to include DPO-specific training logic, including a new `dpo_loss` function and condition to check for DPO loss in the training loop. * **Add Configuration Options**: Add configuration options for DPO in `llms/mlx_lm/examples/lora_config.yaml`. * **Update Documentation**: Update `llms/mlx_lm/README.md` to include instructions for using DPO. * **Add Unit Tests**: Add `llms/tests/test_dpo.py` with unit tests for `get_batched_logps`, `dpo_loss`, and DPO-specific training logic. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/ml-explore/mlx-examples/issues/513?shareId=XXXX-XXXX-XXXX-XXXX).

Goekdeniz-Guelmez · 2025-02-18T21:46:33Z

#1233 #1210 #1209

awni added the enhancement New feature or request label Mar 1, 2024

awni mentioned this issue Apr 10, 2024

DpO training #672

Closed

This was referenced Nov 3, 2024

[Feature Request] Custom "chat" HF datasets #1088

Open

Generalize HF datasets to a collection of HF datasets via hf_datasets #1090

Closed

anupamme linked a pull request Feb 12, 2025 that will close this issue

Add Direct Preference Optimization (DPO) method #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinforcement Learning from Human Feedback (RLHF) examples: Direct Preference Optimization (DPO) #513

Reinforcement Learning from Human Feedback (RLHF) examples: Direct Preference Optimization (DPO) #513

danilopeixoto commented Mar 1, 2024 •

edited

Loading

awni commented Mar 1, 2024

awni commented Mar 1, 2024

ivanfioravanti commented Mar 19, 2024

N8python commented Mar 26, 2024

lin72h commented Mar 27, 2024

kishoretvk commented Apr 11, 2024

developerlin commented May 16, 2024 •

edited

Loading

Goekdeniz-Guelmez commented Feb 18, 2025

Reinforcement Learning from Human Feedback (RLHF) examples: Direct Preference Optimization (DPO) #513

Reinforcement Learning from Human Feedback (RLHF) examples: Direct Preference Optimization (DPO) #513

Comments

danilopeixoto commented Mar 1, 2024 • edited Loading

awni commented Mar 1, 2024

awni commented Mar 1, 2024

ivanfioravanti commented Mar 19, 2024

N8python commented Mar 26, 2024

lin72h commented Mar 27, 2024

kishoretvk commented Apr 11, 2024

developerlin commented May 16, 2024 • edited Loading

Goekdeniz-Guelmez commented Feb 18, 2025

danilopeixoto commented Mar 1, 2024 •

edited

Loading

developerlin commented May 16, 2024 •

edited

Loading