Supervised Fine Tuning and Alignment

Dec 14 2024

For part of my time at the Recurse Center, I’ve been learning more about ML by working through the smol course from Hugging Face with a fellow Recurser, Sam. The smol course is a hands-on introduction to language model alignment. It’s been a great intro to fine tuning a model from scratch and learning more about alignment strategies; I highly recommend it.

At a high level, this is what Sam and I have done in the smol course so far:

Base model (raw text) → SFT (conversation/instruction following) → DPO (quality/alignment)

We’ve taken a base model, improved it via fine tuning, and then used DPO to further refine the quality of the model’s responses.

Chat templates

In section 1 of smol, we learned how to apply a chat template called ChatML to format our dataset in order to support supervised fine tuning and help the model understand conversation structure. We did this programmatically from scratch so we understood it, and then by using the huggingface trl library. This turned the work into a one-liner.

The chat template is useful in several ways:

It helps distinguish between roles in the conversation (user, assistant).
It takes a generalized JSON dataset and applies special tokens that the model learns to recognize as anchor points (e.g. when it sees an end token, it stops generation).

Here’s an example of what that might look like:

No chat template applied:

{
  "messages": [
    {
      "role": "user",
      "content": "How do I bake cookies?"
    },
    {
      "role": "assistant",
      "content": "Here's a basic recipe: Mix butter and sugar, add eggs and vanilla, combine flour and baking soda, then bake at 350°F for 10 minutes."
    }
  ]
}

ChatML chat template applied:

<|im_start|>user
How do I bake cookies?<|im_end|>
<|im_start|>assistant
Here's a basic recipe: Mix butter and sugar, add eggs and vanilla, combine flour and baking soda, then bake at 350°F for 10 minutes.<|im_end|>

With the chat template applied, the speakers and flow of the conversation is clear(er).

Supervised fine tuning

With our dataset formatted, we then took a base model trained on raw text (SmolLM2-135M) and performed supervised fine tuning (SFT) on it. Via SFT, our base model could learn to be better at instruction following and conversational dialogue from the paired examples of inputs and expected outputs in our dataset.

Direct preference optimization

Today, we took our previously fine tuned model and performed DPO (“direct preference optimization”) on it to improve the quality of the responses.

DPO is a form of alignment. This is the original paper where DPO is introduced.

With SFT, we set the stage so that our model could generally handle conversation (alas, even with our interventions, it’s still a very bad conversationalist :robot: ).

With DPO, we further tweaked it to match human preferences using a DPO-specific preference dataset that includes a bunch of “chosen” and “rejected” pairs. (Turns out this is a specific format for DPO).

The model gets trained on bad and good responses that were selected and graded (by a human or perhaps another botty bot :robot_dance: — the important thing is that the responses represent the preferences that we want the model to learn).

From smol:

DPO recasts preference alignment as a classification problem on human preference data.

DPO is apparently unique because it eliminates the need for a separate reward model and a whole reinforcement process, unlike regular old RLHF. I don’t know very much about RLHF, but I learned a bit more today. Corrections welcome!

This is my layperson attempt at describing it:

In normal RLHF, we train a model to predict human preferences (this is the “reward model”).
Then we use that model in our training process to teach our main model what’s good and bad.

RLHF is like training a judge to give feedback, and then having the main model gradually adjust through reinforcement learning based on the judge’s scores. DPO skips that entirely, and optimizes the model directly using pairs of preferred vs non-preferred responses— we are effectively telling the main model “this type of response is better than this one” by showing it examples. We’re skipping an extra model (reward model), and an extra training step (training reward model) when we do DPO.

Other takeaways

in addition to the above, we learned:

how to properly pronounce hygge (turns out it’s hoo-ga and I’ve been saying it wrong the whole time)

Published on: 2024-12-14