A Rapid Tutorial on Unsloth
Unsloth is a library for fast and efficient fine-tuning of large language models. It is built on top of the Hugging Face Transformers library and can be used to fine-tune models on a variety of tasks. In this case we'll train a Llama 3.1 model to do text to sql generation.
Unsloth can be used to do 2x faster training and 60% less memory than standard fine-tuning on single GPU setups. It uses a technique called Quantized Low Rank Adaptation (QLoRA) to quantize the model and train a small number of additional trainable parameters instead of the entire model. This allows us to train the model on a single NVIDIA H100 GPU.
We'll need to use raw pip to install flash-attn and xformers from source.
# Install flash-attn
poetry run pip install flash-attn --no-build-isolation
# Install xformers
poetry run pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
# Install unsloth
poetry run pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
The rest can be installed via poetry from PyPI.
[tool.poetry.dependencies]
trl = "^0.9.6"
nvitop = "^1.3.2"
xformers = "0.0.27"
peft = "^0.12.0"
accelerate = "^0.33.0"
bitsandbytes = "^0.43.3"
Now we can initialize out train.py
script. Load the following Hugging Face libraries:
import torch
from datasets import load_dataset
from transformers import EarlyStoppingCallback, TextStreamer, TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.chat_templates import get_chat_template
We'll download the 8 billion parameter Llama 3.1 model that has been quantized to 4 bits. This means that the model has been compressed to use 4 bits instead of 16 bits for its weights. This makes the model much faster to load and use, but it is still a fairly large model.
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length=max_seq_length,
load_in_4bit=True,
dtype=None,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"up_proj",
"down_proj",
"o_proj",
"gate_proj",
],
use_rslora=True,
use_gradient_checkpointing="unsloth",
)
tokenizer = get_chat_template(
tokenizer,
chat_template="chatml",
)
We'll use a synthetic data set of text to sql prompts.
def convert(question):
return {"user": question["sql_prompt"], "assistant": question["sql"]}
def apply_template(examples):
messages = examples["conversations"]
text = [
tokenizer.apply_chat_template(
convert(message), tokenize=False, add_generation_prompt=False
)
for message in messages
]
return {"text": text}
dataset = load_dataset("gretelai/synthetic_text_to_sql", split="train")
dataset = dataset.map(apply_template, batched=True)
We'll use the 8bit AdamW optimizer with the cosine annealing warm restarts scheduler. In addition we'll add early stopping to prevent overfitting.
callbacks = [
EarlyStoppingCallback(early_stopping_patience=3),
]
training_args = TrainingArguments(
learning_rate=3e-4,
lr_scheduler_type="linear",
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
num_train_epochs=6,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim = "adamw_8bit",
lr_scheduler_type = "cosine_with_restarts",
weight_decay=0.05,
warmup_steps=10,
output_dir="output",
seed=0,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
packing=True,
args=training_args,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
callbacks=callbacks,
)
trainer.train()
print("Done.")
After running this you should see output like this. This will run for six epochs and should take about 2-3 hours to complete on a Nvidia H100 GPU.
# ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
# \\ /| Num examples = 2,254 | Num Epochs = 3
# O^O/ \_/ \ Batch size per device = 2 | Gradient Accumulation steps = 4
# \ / Total batch size = 8 | Total steps = 843
# "-____-" Number of trainable parameters = 29,884,416
Now we can use the model to generate text.
model = FastLanguageModel.for_inference(model)
messages = [
{
"user": "What is the average salary of employees in the IT department?",
"assistant": "",
},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(
input_ids=inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True
)
The model can also be exported to a variety of formats.
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
trainer_stats = trainer.train()
print("Done.")
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
print("Merging model")
model.save_pretrained_merged("fused_model", tokenizer, save_method="merged_16bit")
print("Saving gguf")
model.save_pretrained_gguf("ggufs", tokenizer, quantization_method="f16")
model.save_pretrained_gguf("ggufs", tokenizer, quantization_method="q4_k_m")
You should see output like this:
==((====))== Unsloth: Conversion from QLoRA to GGUF information
\\ /| [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \ [1] Converting HF to GGUF 16bits will take 3 minutes.
\ / [2] Converting GGUF 16bits to ['f16'] will take 10 minutes each.
"-____-" In total, you will have to wait at least 16 minutes.
Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at lora-fused into f16 GGUF format.
The output location will be ./lora-fused/unsloth.F16.gguf
...
Writing: 100%|██████████| 7.64G/7.64G [01:18<00:00, 97.1Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to lora-fused/unsloth.F16.gguf
Unsloth: Conversion completed! Output location: ./lora-fused/unsloth.F16.gguf
And that's it! You now have a fine-tuned text to sql generator. This can easily be adapted to other tasks with minimal changes to the data processing script.
From here you can also train the model using a variety of reinforcement learning techniques to further improve the model's performance.
- Kahneman-Tversky Optimization (KTO) can be used aligning language models with binary feedback data.
- Direct Preference Optimization (DPO) can be used aligning language models with pairwise feedback data.