Fine-tuning Large Language Models

Large Language Models (LLMs) are pre-trained on massive text corpora, often in the trillions of tokens. For example, Llama 2 was trained on approximately 2 trillion tokens of data. However, the base models aren't inherently suited for specific tasks or following instructions. This is where fine-tuning comes in.

Fine-tuning allows us to adapt models for specific downstream tasks, improve performance on domain-specific problems, align model behavior with human preferences, and create specialized variants of base models.

Understanding Model Training Fundamentals

Pre-training

Pre-training represents the foundational phase in developing a large language model, where the model learns general language understanding from massive text corpora. During pre-training, the model processes trillions of tokens - for example, Llama 2 was trained on approximately 2 trillion tokens of data. This process involves initializing random tensors (multi-dimensional matrices) and training them to recognize statistical patterns and relationships between words and concepts.

The pre-training phase requires immense computational resources, often costing millions of dollars and requiring specialized hardware setups. Organizations typically use hundreds or thousands of GPUs running in parallel for weeks or months to complete pre-training. During this phase, the model learns through self-supervised techniques like causal language modeling, where it predicts the next token in a sequence given the preceding context.

Fine-tuning

Fine-tuning builds upon a pre-trained model by further training it on a smaller, task-specific dataset. This process adapts the model's existing knowledge to specific downstream tasks while preserving the foundational understanding gained during pre-training. Traditional fine-tuning updates all model parameters, which requires substantial computational resources - typically 160-192GB of GPU memory for a 7B parameter model.

The fine-tuning process is particularly effective because it leverages the knowledge embedded in the pre-trained weights. For example, the LIMA paper demonstrated that fine-tuning a 65B parameter LLaMA model on just 1,000 high-quality instruction-response pairs could achieve performance comparable to GPT-3. However, this approach requires careful dataset curation and sufficient GPU resources to handle the full model parameters.

There are two primary approaches:

Supervised Fine-Tuning (SFT):
- Uses labeled datasets of instructions and responses
- Adjusts model weights to minimize difference between generated and ground-truth responses
- Simpler to implement than RLHF
- Suitable for most use cases
Reinforcement Learning from Human Feedback (RLHF):
- Models learn through interaction and feedback
- Uses reward signals derived from human evaluations
- More complex but can capture nuanced preferences
- Requires careful reward system design

Parameter-Efficient Fine-tuning (PEFT)

Parameter-efficient fine-tuning techniques have revolutionized how we adapt large language models by dramatically reducing computational requirements while maintaining performance. Rather than updating all model parameters during fine-tuning, PEFT methods strategically modify only a small subset of parameters or introduce a limited number of new trainable parameters.

The most prominent PEFT technique is Low-Rank Adaptation (LoRA), which has become the de facto standard for efficient model adaptation. LoRA works by adding small trainable rank decomposition matrices to specific layers of the model while keeping the pre-trained weights frozen. This approach can reduce the number of trainable parameters by up to 10,000 times and GPU memory requirements by over 3 times compared to full fine-tuning.

When combined with 4-bit quantization techniques (QLoRA), the memory savings become even more dramatic. QLoRA enables the training of models as large as 70 billion parameters on consumer-grade hardware like NVIDIA RTX 3090s - a task that would traditionally require 16 or more A100-80GB GPUs. QLoRA achieves this through several key innovations:

It performs backpropagation through a frozen, 4-bit quantized pre-trained model into Low-Rank Adapters
It introduces a new data type called 4-bit NormalFloat (NF4) that is specifically optimized for normally distributed weights
It implements double quantization to reduce memory footprint by quantizing the quantization constants themselves
It uses paged optimizers to handle memory spikes during training

The effectiveness of QLoRA stems from its ability to maintain full 16-bit fine-tuning task performance while drastically reducing memory requirements. This makes it possible to fine-tune large models on a single GPU that would normally require a cluster of high-end GPUs.

LoRA Implementation Details

LoRA Rank (r)
The LoRA rank determines the number of rank decomposition matrices used. A higher rank allows the model to learn more complex adaptations but requires more memory and computation. The original LoRA paper recommends a minimum rank of 8, though the optimal value depends on your specific task and dataset complexity. For more complex tasks or datasets, you may want to increase the rank, keeping in mind that higher ranks lead to higher computational requirements.

LoRA Alpha
The alpha parameter acts as a scaling factor that determines how much influence the LoRA adaptations have compared to the frozen pre-trained weights. It helps control the magnitude of updates during training. The relationship between rank and alpha is important - typically alpha is set to 2x the rank value as a starting point. This scaling helps ensure stable training while allowing meaningful updates to occur.

Target Modules

The selection of target modules in LoRA represents a critical architectural decision that directly impacts both the model's adaptability and computational efficiency. In transformer-based architectures, these modules consist of various projection matrices and components that handle different aspects of the model's processing pipeline.

The embedding layer (embed_tokens) serves as the model's initial interface with input tokens, transforming discrete token IDs into continuous vector representations. While it's possible to include embeddings as a LoRA target, this is generally discouraged as it can significantly increase memory usage without proportional gains in model performance. The embedding layer typically contains a large number of parameters due to vocabulary size, making it less efficient for LoRA adaptation.

The normalization layers (norm) help stabilize the network's internal representations by standardizing activation values. These layers contain relatively few parameters and are crucial for maintaining stable training dynamics. However, they are rarely targeted for LoRA adaptation because their role is primarily statistical normalization rather than learning complex patterns. Including norm layers in LoRA targets typically offers minimal benefit while potentially destabilizing training.

The language modeling head (lm_head) is responsible for converting the model's internal representations back into vocabulary-sized logits for token prediction. While this layer is crucial for the final output, including it as a LoRA target is generally unnecessary. The lm_head often shares weights with the embedding layer through weight tying, and adapting it separately can break this symmetry without providing significant benefits.

The core attention mechanism components remain the most effective targets for LoRA adaptation. The query projection matrix (q_proj) transforms input embeddings into query vectors, determining how the model searches for relevant information within its context. The key projection matrix (k_proj) creates key vectors that help establish relationships between different parts of the input, while the value projection matrix (v_proj) transforms the input into value vectors that contain the actual information to be extracted. These three projections form the cornerstone of the self-attention mechanism, with q_proj and v_proj often being the most crucial targets for adaptation.

The output projection matrix (o_proj) processes the combined attention outputs before they move to subsequent layers. This transformation ensures the attention mechanism's output maintains compatibility with the model's broader architecture. In models with more complex architectures, you'll also find the upward projection (up_proj) and downward projection (down_proj) matrices, which handle dimensionality transformations between layers.

When implementing LoRA, it's recommended to start with a focused approach targeting just the attention components:

lora_config = LoraConfig(
    target_modules=["q_proj", "v_proj"],
    r=8,
    lora_alpha=16,
    lora_dropout=0.1
)

For more demanding tasks or when initial results aren't satisfactory, you can expand to include additional projection matrices:

lora_config = LoraConfig(
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    r=16,
    lora_alpha=32,
    lora_dropout=0.1
)

Implementation Guide

Required Libraries

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
import torch

Basic QLoRA Configuration

# QLoRA parameters
lora_config = LoraConfig(
    r=64,                    # LoRA attention dimension
    lora_alpha=16,          # LoRA scale factor
    lora_dropout=0.1,       # Dropout probability
    bias="none",
    task_type="CAUSAL_LM"
)

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

Training Configuration

Batch Size and Training Epochs

Understanding batch size and epochs is fundamental to training large language models effectively. A sample represents a single input-output pair in your training dataset, such as an instruction and its corresponding response. These samples are grouped into batches for more efficient processing during training. The batch size determines how many samples the model processes simultaneously before updating its weights, typically ranging from 1 to 512 depending on available GPU memory and training requirements.

A training epoch represents one complete pass through your entire training dataset. For example, if you have 10,000 training samples and a batch size of 32, one epoch would consist of approximately 313 batch updates (10,000 ÷ 32, rounded up). The number of epochs determines how many times the model will see each training sample during the entire training process. For fine-tuning tasks, 1-3 epochs is often sufficient as the model already has strong foundational knowledge from pre-training.

The relationship between batch size and epochs significantly impacts training dynamics. Larger batch sizes enable more efficient parallel processing but may require adjusting other hyperparameters like learning rate to maintain training stability. They also provide more stable gradient estimates but might require more epochs to achieve the same level of model performance. Conversely, smaller batch sizes introduce more noise into the training process, which can sometimes help the model generalize better, but they require more update steps to complete an epoch.

When fine-tuning language models, a common starting point is to use a batch size of 8 or 16 with gradient accumulation to simulate larger batches, combined with 1-3 training epochs. This configuration often provides a good balance between training stability, memory efficiency, and final model performance. The exact values should be adjusted based on your specific hardware constraints and training objectives.

Gradient Accumulation

Gradient accumulation is a powerful technique that helps overcome GPU memory limitations when training large models. Rather than updating model parameters after each batch, gradient accumulation allows us to process multiple smaller batches sequentially while accumulating their gradients, effectively simulating a larger batch size.

When using gradient accumulation, the training process is modified to accumulate gradients over multiple forward and backward passes before applying a single weight update. For example, if you want an effective batch size of 32 but can only fit 8 samples in memory, you would set the gradient accumulation steps to 4. This means the model will:

Process a mini-batch of 8 samples
Calculate and store the gradients without updating weights
Repeat this process 4 times
Finally update the model weights using the accumulated gradients

The key insight is that this produces mathematically equivalent results to training with the larger batch size of 32, while requiring only enough memory to process 8 samples at a time.

To implement gradient accumulation, the training loop needs to be modified to:

training_args = TrainingArguments(
    per_device_train_batch_size=8,    # Physical batch size per GPU
    gradient_accumulation_steps=4,     # Number of forward passes before update
    gradient_checkpointing=True,       # Additional memory optimization
)

The effective batch size can be calculated as:

effective_batch_size = per_device_batch_size * gradient_accumulation_steps * num_gpus

Gradient accumulation provides significant memory savings because:

Forward pass memory is only needed for the smaller physical batch size
Gradient storage remains constant regardless of accumulation steps
Optimizer states are updated less frequently

This makes it possible to train models that would otherwise be too large for available GPU memory. For example, fine-tuning a 7B parameter model might require 32GB of GPU memory with standard training, but could work on a 16GB GPU using gradient accumulation.

When selecting the number of gradient accumulation steps, consider:

Target effective batch size for your training task
Available GPU memory constraints
Training speed requirements (more steps = slower training)
Model convergence characteristics

A good starting point is to choose accumulation steps that result in an effective batch size between 32 and 512, while staying within memory limits. For example, with a physical batch size of 4, you might use 8-32 accumulation steps depending on your specific needs.

While gradient accumulation helps overcome memory constraints, it can affect training dynamics in several ways:

Training speed decreases linearly with accumulation steps
Learning rate may need adjustment for larger effective batch sizes
Batch normalization statistics (if used) are calculated on smaller physical batches

To maintain training stability, consider adjusting the learning rate using the square root scaling rule:

adjusted_lr = base_lr * sqrt(effective_batch_size / base_batch_size)

Learning Rate Schedules

The learning rate schedule plays a crucial role in model training, determining how the learning rate changes throughout the training process. Different schedules offer various trade-offs between training stability, convergence speed, and final model performance.

Linear Schedule

The linear learning rate schedule gradually decreases the learning rate from an initial value to a final value in a straight line. This simple approach works well for many fine-tuning tasks, especially when the number of training steps is relatively small. The linear schedule provides a good balance between early exploration with higher learning rates and final convergence with lower rates.

training_args = TrainingArguments(
    learning_rate=2e-4,
    lr_scheduler_type="linear",
    warmup_ratio=0.03,
    num_train_epochs=3
)

The warmup_ratio parameter determines what fraction of the training steps will use a gradually increasing learning rate before the linear decay begins. This helps prevent unstable updates early in training when gradients might be large or noisy.

Cosine Schedule

The cosine learning rate schedule follows a cosine curve, starting from the initial learning rate and smoothly decreasing to near zero. This schedule provides a more gradual reduction in learning rate compared to linear decay, which can help models converge to better solutions. The cosine schedule is particularly effective for longer training runs where you want to explore the loss landscape thoroughly before settling into a minimum.

training_args = TrainingArguments(
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    num_train_epochs=3
)

The smooth nature of the cosine schedule means that the model spends more time at moderate learning rates compared to linear decay, which can improve generalization performance.

Cosine with Restarts

Cosine with restarts (also known as warm restarts) adds periodic "jumps" to the cosine schedule, temporarily increasing the learning rate before allowing it to decay again. This approach can help the model escape poor local minima and explore different regions of the loss landscape. Each restart provides an opportunity for the model to discover better solutions while maintaining the benefits of the cosine schedule.

training_args = TrainingArguments(
    learning_rate=2e-4,
    lr_scheduler_type="cosine_with_restarts",
    warmup_ratio=0.03,
    num_cycles=2,  # Number of restart cycles
    num_train_epochs=3
)

The num_cycles parameter controls how many restarts occur during training. Each cycle completes a full cosine decay before resetting the learning rate. This schedule is particularly useful for complex tasks where the loss landscape might have many local minima.

Choosing the Right Schedule

When selecting a learning rate schedule, consider these factors:

Training duration: Shorter fine-tuning runs (1-2 epochs) often work well with linear decay, while longer runs benefit from cosine schedules.
Task complexity: More complex tasks might benefit from cosine with restarts to escape local minima.
Computational budget: Simpler schedules like linear decay require less tuning and are more predictable.
Model size: Larger models often benefit from more sophisticated schedules like cosine with restarts.

A good starting point for most fine-tuning tasks is:

training_args = TrainingArguments(
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    num_train_epochs=3,
    warmup_steps=100,
    max_steps=-1  # -1 means use num_train_epochs instead
)

Learning Rate Fundamentals

The learning rate is perhaps the most critical hyperparameter in training language models, acting as a scaling factor that determines how much the model weights should be adjusted in response to the calculated gradients during backpropagation. When using Stochastic Gradient Descent (SGD) or its variants like AdamW, the learning rate directly influences how large of a step the optimizer takes in the direction that reduces the loss function.

A learning rate that's too high can cause the training process to overshoot optimal weight values, leading to unstable training or convergence to suboptimal solutions. This manifests as erratic loss curves or even numerical instability in extreme cases. Conversely, a learning rate that's too low results in very slow training progress, where the model takes tiny steps towards better solutions and might get stuck in local minima or fail to converge within the allocated training time.

For language model fine-tuning, learning rates typically fall between 1e-5 and 5e-4, with 2e-4 being a common starting point when using LoRA. The optimal learning rate often depends on several factors:

The size of the model - larger models generally benefit from smaller learning rates to maintain stability
The choice of optimizer - AdamW typically works well with lower learning rates compared to basic SGD
The batch size - larger batch sizes often allow for slightly higher learning rates
The specific fine-tuning technique being used - LoRA can often use higher learning rates than full fine-tuning

When using LoRA, you can often use higher learning rates than with full fine-tuning because you're only updating a small subset of parameters. A typical configuration might look like:

training_args = TrainingArguments(
    learning_rate=2e-4,          # Higher than full fine-tuning
    weight_decay=0.01,           # L2 regularization
    max_grad_norm=0.3,           # Gradient clipping threshold
    optim="paged_adamw_32bit"    # Memory-efficient optimizer
)

The relationship between learning rate and batch size follows what's known as the "linear scaling rule": when you increase the batch size by a factor of k, you should generally increase the learning rate by the same factor to maintain similar training dynamics. However, this rule begins to break down at very large batch sizes, where the square root scaling rule often works better:

base_lr = 2e-4
base_batch_size = 32
new_batch_size = 128

# Square root scaling
new_lr = base_lr * math.sqrt(new_batch_size / base_batch_size)

Optimizer

The AdamW optimizer has become the de facto standard for training and fine-tuning large language models, combining the benefits of Adam (Adaptive Moment Estimation) with proper weight decay regularization. AdamW extends the traditional Adam optimizer by decoupling weight decay from gradient updates, which leads to better generalization performance, especially in large neural networks.

At its core, AdamW maintains two moments (moving averages) for each parameter: the first moment represents the mean of past gradients, while the second moment tracks the uncentered variance of past gradients. These moments help adapt the learning rate for each parameter individually, making the optimizer particularly effective for training deep neural networks with parameters that require different scales of updates.

The key innovation of AdamW over standard Adam is its handling of weight decay. In traditional Adam, weight decay is implemented as part of the gradient update, which can lead to suboptimal regularization. AdamW applies weight decay directly to the weights before the gradient update, ensuring proper L2 regularization. This seemingly small change has significant implications for model performance, particularly in large language models where proper regularization is crucial for preventing overfitting.

The update rule for AdamW can be broken down into several steps:

# Simplified AdamW update (conceptual implementation)
def adamw_update(params, grads, exp_avg, exp_avg_sq, lr, beta1, beta2, weight_decay):
    for param, grad, m, v in zip(params, grads, exp_avg, exp_avg_sq):
        # Weight decay step
        param.data = param.data * (1 - lr * weight_decay)

        # Update momentum terms
        m.data = beta1 * m.data + (1 - beta1) * grad
        v.data = beta2 * v.data + (1 - beta2) * grad * grad

        # Bias correction
        m_hat = m.data / (1 - beta1)
        v_hat = v.data / (1 - beta2)

        # Parameter update
        param.data -= lr * m_hat / (torch.sqrt(v_hat) + eps)

When configuring AdamW for language model fine-tuning, typical hyperparameters include a weight decay value between 0.01 and 0.1, and beta values of 0.9 and 0.999 for the first and second moments respectively. The optimizer is commonly implemented with the following configuration:

training_args = TrainingArguments(
    optim="paged_adamw_32bit",    # Memory-efficient AdamW implementation
    weight_decay=0.01,            # L2 regularization factor
    adam_beta1=0.9,               # Exponential decay rate for first moment
    adam_beta2=0.999,             # Exponential decay rate for second moment
    adam_epsilon=1e-8,            # Small constant for numerical stability
)

The "paged" variant of AdamW (paged_adamw_32bit) is particularly useful for fine-tuning large language models as it implements memory-efficient state management, reducing the GPU memory required for optimizer states while maintaining the benefits of AdamW's adaptive learning rates and proper weight decay.

Training Duration

training_args = TrainingArguments(
    num_train_epochs=1,
    max_steps=-1,  # -1 means full epochs
)

Quality Assurance

Monitoring Training Progress

Learning Curves

Understanding how to interpret learning curves is crucial for successful model fine-tuning. The learning curve shows how the model's loss changes over time during training, typically plotting both training and validation loss against the number of training steps or epochs. These curves provide valuable insights into how well the model is learning and whether it's experiencing common training issues.

A well-behaved learning curve typically shows both training and validation loss decreasing smoothly over time, eventually plateauing at similar values. The initial steep decline represents the model quickly learning the most obvious patterns in the data. As training progresses, the improvements become more gradual as the model fine-tunes its understanding of more subtle patterns. A small gap between training and validation loss suggests the model is generalizing well to unseen data.

Underfitting occurs when the model fails to learn the underlying patterns in the training data effectively. This manifests in learning curves as persistently high loss values for both training and validation sets, with minimal improvement over time. The curves might appear flat or show very slow improvement, indicating the model lacks the capacity to capture the complexity of the task, or the learning rate might be too low for effective training. In such cases, consider increasing the model's capacity, adjusting the learning rate, or training for more epochs.

Overfitting, conversely, occurs when the model memorizes the training data rather than learning generalizable patterns. The learning curves reveal this through a characteristic divergence: while the training loss continues to decrease, the validation loss begins to increase or plateau at a higher value. This growing gap between training and validation performance indicates the model is becoming too specialized to the training data at the expense of generalization. Common remedies include introducing regularization, reducing model capacity, or implementing early stopping when the validation loss begins to increase.

The rate of convergence in the learning curves can also provide insights into the appropriateness of your learning rate. If the loss decreases too slowly, your learning rate might be too low, resulting in inefficient training. Conversely, if the loss shows high volatility or sudden spikes, the learning rate might be too high, causing unstable training. The ideal learning curve shows steady, consistent improvement without excessive fluctuations.

Evaluation Metrics

Understanding and interpreting evaluation metrics is essential for assessing model performance during and after fine-tuning. The most fundamental metric is the loss function, which measures how far the model's predictions deviate from the ground truth. In language models, this is typically the cross-entropy loss between the predicted token probabilities and the actual next tokens. The training loss represents this measurement on the training data, while the evaluation loss (eval_loss) measures the same on a held-out validation set. A decreasing trend in both metrics indicates the model is learning, while a divergence between them may signal overfitting.

Perplexity, derived directly from the cross-entropy loss through an exponential transformation (e^loss), provides a more interpretable metric for language modeling tasks. It can be understood as how "confused" the model is when predicting the next token - lower values indicate better performance. For example, a perplexity of 10 means the model is as uncertain as if it were choosing between 10 equally likely options at each step. While base models might start with perplexities in the 15-30 range, well-fine-tuned models often achieve perplexities below 10 on their target domain.

For instruction-following and chat models, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores help evaluate the similarity between generated responses and reference answers. ROUGE-1 and ROUGE-2 measure overlap in unigrams and bigrams respectively, while ROUGE-L considers the longest common subsequence. These metrics are particularly useful for assessing how well the model captures key information and maintains coherent phrasing, though they shouldn't be relied upon exclusively as they don't always correlate with human judgments of quality.

BLEU (Bilingual Evaluation Understudy) scores, while originally designed for machine translation, can provide additional insight into the precision of generated text. BLEU scores range from 0 to 1, measuring n-gram overlap between generated and reference texts with a penalty for length mismatches. However, BLEU scores should be interpreted cautiously for instruction-following tasks, as multiple valid responses might use different but equally appropriate phrasing.

The Hugging Face Trainer automatically logs these metrics during training, making them accessible through the training_logs directory or via integration with tracking platforms like Weights & Biases. When analyzing these metrics, it's important to consider their trends rather than absolute values, as the appropriate ranges can vary significantly depending on the task, dataset, and model architecture. A sudden spike in eval_loss or perplexity often indicates a training issue that needs attention, while gradual improvement across all metrics suggests healthy learning progression.

Full Code

"""
llm_finetuner.py - A module for fine-tuning Large Language Models using QLoRA
https://stephendiehl.com/posts/training-llms
"""

import os
import logging
from dataclasses import dataclass
from typing import Optional, Dict, Any

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    logging as transformers_logging
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

@dataclass
class QLoRAConfig:
    """Configuration for QLoRA fine-tuning"""
    r: int = 64  # LoRA attention dimension
    alpha: int = 16  # Alpha parameter for LoRA scaling
    dropout: float = 0.1  # Dropout probability for LoRA layers

    # bitsandbytes parameters
    use_4bit: bool = True
    compute_dtype: str = "float16"  # "float16" or "bfloat16"
    quant_type: str = "nf4"  # "fp4" or "nf4"
    use_nested_quant: bool = False  # Use nested quantization

    def get_compute_dtype(self) -> torch.dtype:
        """Convert string dtype to torch.dtype"""
        return getattr(torch, self.compute_dtype)

@dataclass
class TrainingConfig:
    """Configuration for training parameters"""
    output_dir: str = "./results"
    num_train_epochs: int = 1
    fp16: bool = False
    bf16: bool = False
    per_device_train_batch_size: int = 4
    per_device_eval_batch_size: int = 4
    gradient_accumulation_steps: int = 1
    gradient_checkpointing: bool = True
    max_grad_norm: float = 0.3
    learning_rate: float = 2e-4
    weight_decay: float = 0.001
    optim: str = "paged_adamw_32bit"
    lr_scheduler_type: str = "constant"
    max_steps: int = -1
    warmup_ratio: float = 0.03
    group_by_length: bool = True
    save_steps: int = 25
    logging_steps: int = 25
    max_seq_length: Optional[int] = None
    packing: bool = False
    device_map: Dict[str, Any] = None

    def __post_init__(self):
        if self.device_map is None:
            self.device_map = {"": 0}  # Default to first GPU

class LLMFinetuner:
    """Main class for fine-tuning Large Language Models"""

    def __init__(
        self,
        model_name: str,
        dataset_name: str,
        new_model_name: str,
        qlora_config: Optional[QLoRAConfig] = None,
        training_config: Optional[TrainingConfig] = None,
    ):
        self.model_name = model_name
        self.dataset_name = dataset_name
        self.new_model_name = new_model_name
        self.qlora_config = qlora_config or QLoRAConfig()
        self.training_config = training_config or TrainingConfig()

        self.model = None
        self.tokenizer = None
        self.trainer = None

        # Suppress unnecessary warnings
        transformers_logging.set_verbosity_error()
        logging.getLogger("torch.distributed.distributed_c10d").setLevel(logging.ERROR)

    def _load_dataset(self):
        """Load and preprocess the dataset"""
        return load_dataset(self.dataset_name, split="train")

    def _setup_model_and_tokenizer(self):
        """Setup the model and tokenizer with proper configurations"""
        # Configure quantization
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=self.qlora_config.use_4bit,
            bnb_4bit_quant_type=self.qlora_config.quant_type,
            bnb_4bit_compute_dtype=self.qlora_config.get_compute_dtype(),
            bnb_4bit_use_double_quant=self.qlora_config.use_nested_quant,
        )

        # Load base model
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            quantization_config=bnb_config,
            device_map=self.training_config.device_map,
            trust_remote_code=True
        )
        self.model.config.use_cache = False
        self.model.config.pretraining_tp = 1

        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name,
            trust_remote_code=True
        )
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = "right"

    def _setup_lora_config(self):
        """Setup LoRA configuration"""
        return LoraConfig(
            lora_alpha=self.qlora_config.alpha,
            lora_dropout=self.qlora_config.dropout,
            r=self.qlora_config.r,
            bias="none",
            task_type="CAUSAL_LM",
        )

    def _setup_training_arguments(self):
        """Setup training arguments"""
        return TrainingArguments(
            output_dir=self.training_config.output_dir,
            num_train_epochs=self.training_config.num_train_epochs,
            per_device_train_batch_size=self.training_config.per_device_train_batch_size,
            gradient_accumulation_steps=self.training_config.gradient_accumulation_steps,
            optim=self.training_config.optim,
            save_steps=self.training_config.save_steps,
            logging_steps=self.training_config.logging_steps,
            learning_rate=self.training_config.learning_rate,
            weight_decay=self.training_config.weight_decay,
            fp16=self.training_config.fp16,
            bf16=self.training_config.bf16,
            max_grad_norm=self.training_config.max_grad_norm,
            max_steps=self.training_config.max_steps,
            warmup_ratio=self.training_config.warmup_ratio,
            group_by_length=self.training_config.group_by_length,
            lr_scheduler_type=self.training_config.lr_scheduler_type,
            report_to="tensorboard"
        )

    def train(self):
        """Execute the full training pipeline"""
        dataset = self._load_dataset()
        self._setup_model_and_tokenizer()

        # Setup trainer
        self.trainer = SFTTrainer(
            model=self.model,
            train_dataset=dataset,
            peft_config=self._setup_lora_config(),
            dataset_text_field="text",
            max_seq_length=self.training_config.max_seq_length,
            tokenizer=self.tokenizer,
            args=self._setup_training_arguments(),
            packing=self.training_config.packing,
        )

        # Train the model
        self.trainer.train()

        # Save the trained model
        self.trainer.model.save_pretrained(self.new_model_name)

    def merge_and_save(self):
        """Merge LoRA weights and save the final model"""
        # Load base model in FP16
        base_model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            low_cpu_mem_usage=True,
            return_dict=True,
            torch_dtype=torch.float16,
            device_map=self.training_config.device_map,
        )

        # Merge weights
        model = PeftModel.from_pretrained(base_model, self.new_model_name)
        model = model.merge_and_unload()

        # Save merged model
        model.push_to_hub(self.new_model_name, use_temp_dir=False)
        self.tokenizer.push_to_hub(self.new_model_name, use_temp_dir=False)

To invoke the module.

# Example usage
if __name__ == "__main__":
    # Initialize configs
    qlora_config = QLoRAConfig(
        r=64,
        alpha=16,
        dropout=0.1,
        use_4bit=True
    )

    training_config = TrainingConfig(
        output_dir="./results",
        num_train_epochs=1,
        per_device_train_batch_size=4,
        learning_rate=2e-4
    )

    # Initialize fine-tuner
    finetuner = LLMFinetuner(
        model_name="NousResearch/llama-2-7b-chat-hf",
        dataset_name="mlabonne/guanaco-llama2-1k",
        new_model_name="llama-2-7b-miniguanaco",
        qlora_config=qlora_config,
        training_config=training_config
    )

    # Train the model
    finetuner.train()

    # Merge and save the final model
    finetuner.merge_and_save()