Wonderful-tune a Mistral-7b mannequin with Direct Desire Optimization | by Maxime Labonne | Jan, 2024

Enhance the efficiency of your supervised fine-tuned fashions


Thank you for reading this post, don't forget to subscribe!
Maxime Labonne


Towards Data Science
Picture by writer

Pre-trained Giant Language Fashions (LLMs) can solely carry out next-token prediction, making them unable to reply questions. That is why these base fashions are then fine-tuned on pairs of directions and solutions to behave as useful assistants. Nonetheless, this course of can nonetheless be flawed: fine-tuned LLMs may be biased, poisonous, dangerous, and so on. That is the place Reinforcement Studying from Human Suggestions (RLHF) comes into play.

RLHF offers completely different solutions to the LLM, that are ranked in response to a desired conduct (helpfulness, toxicity, and so on.). The mannequin learns to output one of the best reply amongst these candidates, therefore mimicking the conduct we need to instill. Usually seen as a method to censor fashions, this course of has not too long ago develop into widespread for enhancing efficiency, as proven in neural-chat-7b-v3–1.

On this article, we are going to create NeuralHermes-2.5, by fine-tuning OpenHermes-2.5 utilizing a RLHF-like method: Direct Desire Optimization (DPO). For this objective, we are going to introduce a desire dataset, describe how the DPO algorithm works, and apply it to our mannequin. We’ll see that it considerably improves the efficiency of the bottom mannequin on the Open LLM Leaderboard.

As per traditional, the code is out there on GitHub and Google Colab.

Desire datasets aren’t standardized, however they sometimes include a set of solutions which are ranked by people. This rating is crucial, because the RLHF course of fine-tunes LLMs to output the popular reply. Right here is an instance of Anthropic/hh-rlhf, a preferred desire dataset:

Picture by writer

The construction of the dataset is easy: for every row, there’s one chosen (most popular) reply, and one rejected reply. The aim of RLHF is to information the mannequin to output the popular reply.

Desire datasets are notoriously expensive and tough to make, as they require amassing handbook suggestions from people. This suggestions can also be subjective and may simply be biased towards assured (however improper) solutions or contradict itself (completely different annotators have completely different values). Over time, a number of options have been proposed to sort out these points, equivalent to changing human suggestions with AI suggestions (RLAIF).

These datasets additionally are usually quite a bit smaller than fine-tuning datasets. As an example this, the superb neural-chat-7b-v3–1 (finest 7B LLM on the Open LLM Leaderboard when it was launched) makes use of 518k samples for fine-tuning (Open-Orca/SlimOrca) however solely 12.9k samples for RLHF (Intel/orca_dpo_pairs). On this case, the authors generated solutions with GPT-4/3.5 to create the popular solutions, and with Llama 2 13b chat to create the rejected responses. It’s a wise method to bypass human suggestions and solely depend on fashions with completely different ranges of efficiency.

Whereas the idea of RLHF has been utilized in robotics for a very long time, it was popularized for LLMs in OpenAI’s paper Wonderful-Tuning Language Fashions from Human Preferences. On this paper, the authors current a framework the place a reward mannequin is skilled to approximate human suggestions. This reward mannequin is then used to optimize the fine-tuned mannequin’s coverage utilizing the Proximal Coverage Optimization (PPO) algorithm.

Picture by writer

The core idea of PPO revolves round making smaller, incremental updates to the coverage, as bigger updates can result in instability or suboptimal options. From expertise, this method is sadly nonetheless unstable (loss diverges), tough to breed (quite a few hyperparameters, delicate to random seeds), and computationally costly.

That is the place Direct Desire Optimization (DPO) comes into play. DPO simplifies management by treating the duty as a classification drawback. Concretely, it makes use of two fashions: the skilled mannequin (or coverage mannequin) and a replica of it referred to as the reference mannequin. Throughout coaching, the aim is to verify the skilled mannequin outputs larger chances for most popular solutions than the reference mannequin. Conversely, we additionally need it to output decrease chances for rejected solutions. It means we’re penalizing the LLM for unhealthy solutions and rewarding it for good ones.

Picture by writer

By utilizing the LLM itself as a reward mannequin and using binary cross-entropy targets, DPO effectively aligns the mannequin’s outputs with human preferences with out the necessity for intensive sampling, reward mannequin becoming, or intricate hyperparameter changes. It leads to a extra steady, extra environment friendly, and computationally much less demanding course of.

On this instance, we’ll fine-tune the superb OpenHermes-2.5-Mistral-7B, which is a Mistral-7b mannequin that was solely supervised fine-tuned. To this finish, we’ll use the Intel/orca_dpo_pairs dataset to align our mannequin and enhance its efficiency. We name this new mannequin NeuralHermes-2.5-Mistral-7B.

Step one consists of putting in the required libraries as follows.

pip set up -q datasets trl peft bitsandbytes sentencepiece wandb

As soon as it’s performed, we will import the libraries. I’m additionally utilizing the secrets and techniques tab in Google Colab to retailer my Hugging Face token.

import os
import gc
import torch

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer
import bitsandbytes as bnb
from google.colab import userdata
import wandb

# Outlined within the secrets and techniques tab in Google Colab
hf_token = userdata.get(‘huggingface’)
wb_token = userdata.get(‘wandb’)

model_name = “teknium/OpenHermes-2.5-Mistral-7B”
new_model = “NeuralHermes-2.5-Mistral-7B”


OpenHermes-2.5-Mistral-7B makes use of a particular chat template, referred to as ChatML. Right here is an instance of a dialog formatted with this template:

You're a useful chatbot assistant.<|im_end|>
Hello, how can I provide help to?<|im_end|>

As you possibly can see, ChatML defines completely different roles (system, consumer, assistant) and appends particular tokens (<|im_start|> and <|im_end|>) to separate them. Furthermore, DPOTrainer additionally requires a particular format with three columns: immediate, chosen, and rejected.

Our dataset comprises 4 columns: system, query, chatgpt, and llama2–13b-chat. We’ll merely concatenate the system and query columns to the immediate column. We’ll additionally map the chatgpt column to “chosen” and llama2–13b-chat to “rejected”. To format the dataset in a dependable means, we’ll use the tokenizer’s apply_chat_template() perform, which already makes use of ChatML.

def chatml_format(instance):
# Format system
if len(instance['system']) > 0:
message = "function": "system", "content material": instance['system']
system = tokenizer.apply_chat_template([message], tokenize=False)
system = ""

# Format instruction
message = “function”: “consumer”, “content material”: instance[‘question’]
immediate = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)

# Format chosen reply
chosen = instance[‘chosen’] + “<|im_end|>n”

# Format rejected reply
rejected = instance[‘rejected’] + “<|im_end|>n”

“immediate”: system + immediate,
“chosen”: chosen,
“rejected”: rejected,

# Load dataset
dataset = load_dataset(“Intel/orca_dpo_pairs”)[‘train’]

# Save columns
original_columns = dataset.column_names

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “left”

# Format dataset
dataset = dataset.map(


Let’s print a pattern of the formatted dataset to verify that every little thing works as anticipated:


We will see that the immediate combines system and consumer directions. Because of the add_generation_prompt=True argument, it additionally appends the start of the assistant’s reply. If you wish to skip this step, you possibly can immediately used the preprocessed dataset as mlabonne/chatml_dpo_pairs.

Subsequent, we outline the LoRA configurations to coach the mannequin. As described in Intel’s weblog put up, we set the rank worth to be equal to the lora_alpha, which is uncommon (2 * r as a rule of thumb). We additionally goal all of the linear modules with adapters.

# LoRA configuration
peft_config = LoraConfig(
target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']

We’re now able to load the mannequin we need to fine-tune with DPO. On this case, two fashions are required: the mannequin to fine-tune in addition to the reference mannequin. That is largely for the sake of readability, because the DPOTrainer object mechanically creates a reference mannequin if none is supplied.

# Mannequin to fine-tune
mannequin = AutoModelForCausalLM.from_pretrained(
mannequin.config.use_cache = False

# Reference mannequin
ref_model = AutoModelForCausalLM.from_pretrained(


The ultimate step consists of offering all of the hyperparameters to TrainingArguments and DPOTrainer:

  • Amongst them, the beta parameter is exclusive to DPO because it controls the divergence from the preliminary coverage (0.1 is a typical worth for it).
  • In comparison with the values described in Intel’s weblog put up, we decrease the educational price (from 5e-4 to 5e-5) and the variety of steps (from 1,000 to 200). I manually optimized these values after a couple of runs to stabilize coaching and obtain one of the best outcomes.

We will now begin coaching the mannequin. Word that it requires an A100 GPU and takes between 1 hour to finish the coaching.

# Coaching arguments
training_args = TrainingArguments(

# Create DPO coach
dpo_trainer = DPOTrainer(

# Wonderful-tune mannequin with DPO


Our mannequin is now fine-tuned. You may examine the mission on Weights & Biases at this deal with. Listed here are some attention-grabbing metrics to investigate:

Picture by writer

Apparently, the coaching loss shortly drops to zero (earlier than 50 steps), regardless of 100 warmup steps. In the meantime, the opposite metrics maintain evolving.

The prepare/rewards/chosen and prepare/rewards/rejected plots correspond to the imply distinction between the log chances output by the skilled and reference fashions. It is sensible that, over time, they diverge as our skilled mannequin learns the popular solutions. The prepare/rewards/margins plot additionally exhibits the distinction between these two plots. Lastly, the prepare/reward/accuracies plot exhibits the frequency of selecting the popular reply. The skilled mannequin shortly reaches an ideal accuracy rating, which is an effective signal however might additionally imply that the distinction between most popular and rejected solutions is simply too apparent.

Now that it’s skilled, we will merge the adapter with the unique mannequin. Subsequent, we save the merged mannequin and the tokenizer earlier than pushing it to the Hugging Face Hub.

# Save artifacts

# Flush reminiscence
del dpo_trainer, mannequin, ref_model

# Reload mannequin in FP16 (as a substitute of NF4)
base_model = AutoModelForCausalLM.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Merge base mannequin with the adapter
mannequin = PeftModel.from_pretrained(base_model, “final_checkpoint”)
mannequin = mannequin.merge_and_unload()

# Save mannequin and tokenizer

# Push them to the HF Hub
mannequin.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)


Let’s see how our mannequin performs in an actual check. We’ll format the immediate to ask a primary query: “What’s a Giant Language Mannequin?”

# Format immediate
message = [
"role": "system", "content": "You are a helpful assistant chatbot.",
"role": "user", "content": "What is a Large Language Model?"
tokenizer = AutoTokenizer.from_pretrained(new_model)
immediate = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

# Create pipeline
pipeline = transformers.pipeline(

# Generate textual content
sequences = pipeline(


Right here’s the reply from the mannequin:

A big language mannequin is a kind of synthetic intelligence (AI) system that has been skilled on huge quantities of textual content knowledge. These fashions are designed to grasp and generate human language, permitting them to carry out numerous pure language processing duties, equivalent to textual content era, language translation, and query answering. Giant language fashions sometimes use deep studying strategies, like recurrent neural networks (RNNs) or transformers, to study patterns and relationships within the knowledge, enabling them to generate coherent and contextually related responses. The scale of those fashions, by way of the variety of parameters and the amount of knowledge they're skilled on, performs a big function of their potential to grasp and produce advanced language buildings.

All the things appears to be working, we will now consider the merged mannequin. As it is a general-purpose mannequin, we will leverage the lm-evaluation-harness to judge it. As the method is kind of resource-intensive, we will additionally immediately submit it for analysis on the Open LLM Leaderboard. It took a couple of days, however listed below are the outcomes in comparison with different OpenHermes fashions:

Picture by writer

In comparison with the unique mannequin, NeuralHermes-2–5-Mistral-7B mannequin improved the common rating by 6.7 factors (notably on GSM8K). That is an unexpectedly giant enchancment, which showcases the facility of Direct Desire Optimization.

On this article, we fine-tuned an already supervised fine-tuned mannequin utilizing DPO and created our personal NeuralHermes-2.5 mannequin. By leveraging a high-quality desire dataset, we created a sample-efficient fine-tuning pipeline that produced a big enchancment on the Open LLM Leaderboard. If you wish to give it a attempt, you will discover quantized variants of this mannequin or use this Hugging Face House.

Word that our fine-tuning pipeline can nonetheless be improved in numerous methods. For instance, the desire dataset remains to be fairly uncooked and may very well be improved with extra filtering and by utilizing completely different fashions. As well as, quite a few hyperparameters can nonetheless be tweaked to attain higher outcomes. Specifically, the educational price can nonetheless be lowered to coach the mannequin on extra steps and inject extra desire knowledge.

Leave a Reply

Your email address will not be published. Required fields are marked *