Setting up a Nvidia GH200 for Development
Setting up an Nvidia GH200 from scratch is a bit fickle. I eventually got it working with CUDA 12.4 and FlashAttention-2, but it took a bit of trial and error. Here's what I did:
The GH200 performs optimally with a 64K kernel page size. Nvidia provides specialized kernel packages for Ubuntu systems. First, remove the existing kernel packages and install the Nvidia 64K kernel:
sudo DEBIAN_FRONTEND=noninteractive apt purge linux-image-$(uname -r) \
linux-headers-$(uname -r) linux-modules-$(uname -r) -y
sudo apt update
sudo apt install linux-nvidia-64k-hwe-22.04 -y
sudo reboot now
After the system reboots, verify you're using the correct kernel:
uname -r
It should read:
linux-nvidia-64k-hwe-22.04-555.42.06
Now let's update the base system:
apt-get update
The GH200 requires specific NVIDIA drivers and MLNX_OFED driver. Install these packages. Order is extremely important here!
apt-get install -y mlnx-fw-updater mlnx-ofed-all
apt-get install -y cuda-drivers-555 nvidia-kernel-open-555 linux-tools-$(uname -r)
apt-get install -y cuda-toolkit nvidia-container-toolkit
The NVIDIA persistence daemon needs to be configured to use the persistence mode.
sudo mkdir /lib/systemd/system/nvidia-persistenced.service.d
sudo dd status=none of=/lib/systemd/system/nvidia-persistenced.service.d/override.conf << EOF
[Service]
ExecStart=
ExecStart=/usr/bin/nvidia-persistenced --persistence-mode --verbose
[Install]
WantedBy=multi-user.target
EOF
systemctl enable nvidia-persistenced --now
Configure CUDA paths by creating a new profile script:
echo "# Library Path for Nvidia CUDA" >> /etc/profile.d/cuda.sh
echo export LD_LIBRARY_PATH=/usr/local/cuda/lib64:'$LD_LIBRARY_PATH' >> /etc/profile.d/cuda.sh
echo export PATH=$HOME/.local/bin:/usr/local/cuda/bin:'$PATH' >> /etc/profile.d/cuda.sh
For optimal performance, disable the IRQ balance service and configure NUMA settings:
systemctl disable irqbalance
echo "kernel.numa_balancing = 0" >> /etc/sysctl.conf
sysctl -p
Now let's enable peer memory access by adding the NVIDIA peer memory module to the system:
echo "nvidia-peermem" >> /etc/modules-load.d/nvidia-peermem.conf
Verify the module is loaded:
lsmod | grep nvidia_peermem
After completing all configurations, reboot your system:
systemctl disable first-boot.service
reboot
After the system reboots, verify your installation by running:
nvidia-smi
This command should display information about your GH200, including the GPU model, driver version, and current utilization.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | 0 |
| N/A 33C P0 76W / 900W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
You can inspect the system topology to understand how data travels between GPUs and networking components:
nvidia-smi topo -m
This command shows the connectivity matrix between GPUs and other PCIe devices. For a GH200, you should see something like:
GPU0 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE 0-71 1
NIC0 NODE X PIX
NIC1 NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMB interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NICO: mlx5_0
NIC1: mlx5_1
Setting up vLLM with Flash Attention
First, update Python package management tools:
python -m pip install -U pip
python -m pip install -U setuptools wheel
Install PyTorch with CUDA 12.4 support:
pip install torch --index-url https://download.pytorch.org/whl/cu124
Install Flash Attention:
pip install flash-attn --no-build-isolation
Clone and build vLLM with Flash Attention support:
git clone git@github.com:vllm-project/vllm.git
cd vllm
sudo docker build --target build -t vllm_build .
container_id=$(sudo docker create --name vllm_temp vllm_build:latest)
sudo docker cp ${container_id}:/workspace/dist .
# Install vLLM with Flash Attention
pip install vllm-flash-attn
pip install dist/vllm-0.4.2+cu124-cp312-cp312-linux_x86_64.whl
Install xformers from source:
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
Set up your environment variables by adding the following to your ~/.bashrc
:
export HF_TOKEN=<HF-TOKEN>
export HF_HOME="/home/ubuntu/.cache/huggingface"
export MODEL_REPO=meta-llama/Meta-Llama-3.1-70B-Instruct
Apply the changes:
source ~/.bashrc
Launch the vLLM API server with the following command:
vllm serve $MODEL_REPO --dtype auto
You can now send requests to the API endpoint at http://localhost:8000/v1/completions
.
Setting up sglang with FlashInfer
Alternatively we can serve using SGLang with FlashInfer.
# Use the last release branch
git clone -b v0.3.5.post2 https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
Launch the SGLang server with the following command:
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-70B-Instruct
Likewise, you can send requests to the API endpoint at http://localhost:8000/v1/completions
using the OpenAI API format.