Fantastic-Tuning an Open-Supply LLM with Axolotl Utilizing Direct Desire Optimization (DPO)

LLMs have unlocked numerous new alternatives for AI functions. Should you’ve ever wished to fine-tune your individual mannequin, this information will present you methods to do it simply and with out writing any code. Utilizing instruments like Axolotl and DPO, we’ll stroll via the method step-by-step.

What Is an LLM?

A Massive Language Mannequin (LLM) is a strong AI mannequin skilled on huge quantities of textual content knowledge—tens of trillions of characters—to foretell the subsequent set of phrases in a sequence. This has solely been made doable within the final 2-3 years with the advances which have been made in GPU compute, which have allowed such enormous fashions to be skilled in a matter of some weeks.

You’ve doubtless interacted with LLMs via merchandise like ChatGPT or Claude earlier than and have skilled firsthand their capability to grasp and generate human-like responses.

Why Fantastic-Tune an LLM?

Can’t we simply use GPT-4o for every thing? Effectively, whereas it’s the strongest mannequin we now have on the time of writing this text, it’s not all the time probably the most sensible selection. Fantastic-tuning a smaller mannequin, starting from 3 to 14 billion parameters, can yield comparable outcomes at a small fraction of the price. Furthermore, fine-tuning lets you personal your mental property and reduces your reliance on third events.

Understanding Base, Instruct, and Chat Fashions

Earlier than diving into fine-tuning, it’s important to grasp the several types of LLMs that exist:

Base Fashions: These are pretrained on massive quantities of unstructured textual content, reminiscent of books or web knowledge. Whereas they’ve an intrinsic understanding of language, they aren’t optimized for inference and can produce incoherent outputs. Base fashions are developed to function a place to begin for growing extra specialised fashions.
Instruct Fashions: Constructed on prime of base fashions, instruct fashions are fine-tuned utilizing structured knowledge like prompt-response pairs. They’re designed to observe particular directions or reply questions.
Chat Fashions: Additionally constructed on base fashions, however not like instruct fashions, chat fashions are skilled on conversational knowledge, enabling them to interact in back-and-forth dialogue.

What Is Reinforcement Studying and DPO?

Reinforcement Studying (RL) is a way the place fashions study by receiving suggestions on their actions. It’s utilized to instruct or chat fashions to be able to additional refine the standard of their outputs. Sometimes, RL shouldn’t be achieved on prime of base fashions because it makes use of a a lot decrease studying fee which is not going to transfer the needle sufficient.

DPO is a type of RL the place the mannequin is skilled utilizing pairs of excellent and unhealthy solutions for a similar immediate/dialog. By presenting these pairs, the mannequin learns to favor the nice examples and keep away from the unhealthy ones.

When to Use DPO

DPO is especially helpful whenever you wish to modify the fashion or habits of your mannequin, for instance:

Fashion Changes: Modify the size of responses, the extent of element, or the diploma of confidence expressed by the mannequin.
Security Measures: Prepare the mannequin to say no answering doubtlessly unsafe or inappropriate prompts.

Nonetheless, DPO shouldn’t be appropriate for instructing the mannequin new information or details. For that goal, Supervised Fantastic-Tuning (SFT) or Retrieval-Augmented Technology (RAG) strategies are extra acceptable.

Making a DPO Dataset

In a manufacturing surroundings, you’ll usually generate a DPO dataset utilizing suggestions out of your customers, by for instance:

Consumer Suggestions: Implementing a thumbs-up/thumbs-down mechanism on responses.
Comparative Decisions: Presenting customers with two totally different outputs and asking them to decide on the higher one.

Should you lack consumer knowledge, you may as well create an artificial dataset by leveraging bigger, extra succesful LLMs. For instance, you possibly can generate unhealthy solutions utilizing a smaller mannequin after which use GPT-4o to appropriate them.

For simplicity, we’ll use a ready-made dataset from HuggingFace: olivermolenschot/alpaca_messages_dpo_test. Should you examine the dataset, you’ll discover it incorporates prompts with chosen and rejected solutions—these are the nice and unhealthy examples. This knowledge was created synthetically utilizing GPT-3.5-turbo and GPT-4.

You’ll typically want between 500 and 1,000 pairs of information at a minimal to have efficient coaching with out overfitting. The most important DPO datasets include as much as 15,000–20,000 pairs.

Fantastic-Tuning Qwen2.5 3B Instruct with Axolotl

We’ll be utilizing Axolotl to fine-tune the Qwen2.5 3B Instruct mannequin which at present ranks on the prime of the OpenLLM Leaderboard for its dimension class. With Axolotl, you possibly can fine-tune a mannequin with out writing a single line of code—only a YAML configuration file. Under is the config.yml we’ll use:

base_model: Qwen/Qwen2.5-3B-Instruct
strict: false

# Axolotl will routinely map the dataset from HuggingFace to the immediate template of Qwen 2.5
chat_template: qwen_25
rl: dpo
datasets:
  - path: olivermolenschot/alpaca_messages_dpo_test
    kind: chat_template.default
    field_messages: dialog
    field_chosen: chosen
    field_rejected: rejected
    message_field_role: function
    message_field_content: content material

# We decide a listing inside /workspace since that is usually the place cloud hosts mount the amount
output_dir: /workspace/dpo-output

# Qwen 2.5 helps as much as 32,768 tokens with a max era of 8,192 tokens
sequence_len: 8192

# Pattern packing doesn't at present work with DPO. Pad to sequence size is added to keep away from a Torch bug
sample_packing: false
pad_to_sequence_len: true

# Add your WanDB account if you wish to get good reporting in your coaching efficiency
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

# Could make coaching extra environment friendly by batching a number of rows collectively
gradient_accumulation_steps: 1
micro_batch_size: 1

# Do one move on the dataset. Can set to a better quantity like 2 or 3 to do a number of
num_epochs: 1

# Optimizers do not make a lot of a distinction when coaching LLMs. Adam is the usual
optimizer: adamw_torch

# DPO requires a smaller studying fee than common SFT
lr_scheduler: fixed
learning_rate: 0.00005

# Prepare in bf16 precision because the base mannequin can also be bf16
bf16: auto

# Reduces reminiscence necessities
gradient_checkpointing: true

# Makes coaching quicker (solely suported on Ampere, Ada, or Hopper GPUs)
flash_attention: true

# Can save a number of occasions per epoch to get a number of checkpoint candidates to match
saves_per_epoch: 1

logging_steps: 1
warmup_steps: 0

Setting Up the Cloud Setting

To run the coaching, we’ll use a cloud internet hosting service like Runpod or Vultr. Right here’s what you’ll want:

Docker Picture: Clone the winglian/axolotl-cloud:important Docker picture supplied by the Axolotl staff.
*{Hardware} Necessities: An 80GB VRAM GPU (like a 1×A100 PCIe node) can be greater than sufficient for this dimension of a mannequin.
Storage: 200GB of quantity storage to will accommodate all information we want.
CUDA Model: Your CUDA model ought to be a minimum of 12.1.

*One of these coaching is taken into account a full fine-tune of the LLM, and is thus very VRAM intensive. Should you’d wish to run a coaching regionally, with out counting on cloud hosts, you could possibly try to make use of QLoRA, which is a type of Supervised Fantastic-tuning. Though it’s theoretically doable to mix DPO & QLoRA, that is very seldom achieved.

Steps to Begin Coaching

Set HuggingFace Cache Listing:

export HF_HOME=/workspace/hf

This ensures that the unique mannequin downloads to our quantity storage which is persistent.

Create Configuration File: Save the config.yml file we created earlier to /workspace/config.yml.

Begin Coaching:

python -m axolotl.cli.prepare /workspace/config.yml

And voila! Your coaching ought to begin. After Axolotl downloas the mannequin and the trainig knowledge, you need to see output much like this:

[2024-12-02 11:22:34,798] [DEBUG] [axolotl.train.train:98] [PID:3813] [RANK:0] loading mannequin

[2024-12-02 11:23:17,925] [INFO] [axolotl.train.train:178] [PID:3813] [RANK:0] Beginning coach...

The coaching ought to take just some minutes to finish since it is a small dataset of solely 264 rows. The fine-tuned mannequin can be saved to /workspace/dpo-output.

Importing the Mannequin to HuggingFace

You possibly can add your mannequin to HuggingFace utilizing the CLI:

Set up the HuggingFace Hub CLI:

pip set up huggingface_hub[cli]

Add the Mannequin:

huggingface-cli add /workspace/dpo-output yourname/yourrepo

Change yourname/yourrepo along with your precise HuggingFace username and repository title.

Evaluating Your Fantastic-Tuned Mannequin

For analysis, it’s beneficial to host each the unique and fine-tuned fashions utilizing a device like Textual content Technology Inference (TGI). Then, carry out inference on each fashions with a temperature setting of 0 (to make sure deterministic outputs) and manually evaluate the responses of the 2 fashions.

This hands-on method supplies higher insights than solely counting on coaching analysis loss metrics, which can not seize the nuances of language era in LLMs.

Conclusion

Fantastic-tuning an LLM utilizing DPO lets you customise fashions to raised fit your utility’s wants, all whereas protecting prices manageable. By following the steps outlined on this article, you possibly can harness the facility of open-source instruments and datasets to create a mannequin that aligns along with your particular necessities. Whether or not you’re trying to modify the fashion of responses or implement security measures, DPO supplies a sensible method to refining your LLM.

Pleased fine-tuning!