Skip to main content

Command Palette

Search for a command to run...

Step-by-Step — Compiling vLLM from Source on NVIDIA Blackwell (GB10)

Updated
3 min read
Step-by-Step — Compiling vLLM from Source on NVIDIA Blackwell (GB10)
B
About Me I am a Computer Science Engineer and Data Science Master’s graduate, currently building the future of Arabic text digitization at Kalima OCR. My work focuses on bridging the gap between cutting-edge AI research and production-ready SaaS applications. I thrive at the intersection of AI and DevOps. Whether it's training and fine-tuning models in PyTorch, orchestrating high-throughput inference on NVIDIA Blackwell (GB10) hardware, or building responsive UIs with Next.js, I enjoy the full-stack challenge of turning complex research ideas into usable, sovereign data applications. What I’m working on now: Deploying large-scale Mixture-of-Experts (MoE) models on DGX infrastructure. Optimizing LLM inference pipelines (vLLM, LiteLLM) for low-latency production environments. Building developer-centric AI portals to make high-end models accessible to product teams. I share my journey here to help other engineers navigate the messy, high-pressure world of "DevOps-for-AI." Expect deep dives into hardware, model optimization, and the occasional battle with a broken CUDA kernel.

If you've been trying to run standard vLLM Docker images on the new NVIDIA DGX Spark, you’ve probably hit a wall. Between the ARM64 (Grace) architecture, the Blackwell (GB10) GPU, and the requirements of CUDA 13, the "easy way" doesn't work yet.

To get the performance your hardware deserves, we have to compile vLLM natively. Here is the exact build guide to get it running.


1. System Dependencies

Standard Ubuntu installs often lack the headers required to compile Python-to-C++ bindings. Make sure you have the basics:

sudo apt-get update
sudo apt-get install -y gcc-12 g++-12 build-essential python3-dev

2. Environment Configuration

We need to point the compiler directly to the CUDA 13 toolkit and specify the architecture list for Blackwell. Add these to your shell (or ~/.bashrc):

export CUDA_HOME=/usr/local/cuda-13.0
export PATH="\(CUDA_HOME/bin:\)PATH"
export LD_LIBRARY_PATH="\(CUDA_HOME/lib64:\)LD_LIBRARY_PATH"

# Point to GCC-12
export CC=/usr/bin/gcc-12
export CXX=/usr/bin/g++-12
export CUDAHOSTCXX=/usr/bin/g++-12

# Blackwell Architecture
export TORCH_CUDA_ARCH_LIST="9.0;12.0;12.1"
export MAX_JOBS=16

3. Virtual Environment Setup

Keep your build environment clean. Do not do this globally.

python3 -m venv vllm_env
source vllm_env/bin/activate

# Install build-time dependencies
pip install --upgrade pip setuptools wheel
pip install numpy ninja packaging pybind11 "setuptools>=61" "setuptools-scm>=8" cmake

4. The "No-Build-Isolation" Build

This is the most critical step. Standard pip builds often fail because they create an isolated environment that can't see your optimized system drivers. We bypass this to use the environment we just set up.

git clone https://github.com/vllm-project/vllm.git
cd vllm

# Force build in the current environment
export VLLM_TARGET_DEVICE=cuda
pip install -e . --no-build-isolation

5. Stability Flags (Blackwell/GB10)

Blackwell is so new that standard kernels sometimes struggle. For production, I recommend these flags in your systemd service file to prevent kernel crashes:

[Service]
# Disable experimental V1 engine and Marlin kernels for stability
Environment="VLLM_USE_V1=0"
Environment="VLLM_USE_MARLIN=0"

ExecStart=/path/to/venv/python -m vllm.entrypoints.openai.api_server \
    --model /path/to/model \
    --quantization gptq \
    --dtype float16 \
    --enforce-eager \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096

Troubleshooting Common Errors

  • TypeError: unsupported operand type(s) for |: 'list' and 'set': Update transformers:
    pip install git+https://github.com/huggingface/transformers.git

  • ValueError: Free memory on device...: You’ve allocated too much memory. Lower your --gpu-memory-utilization.

  • Connection Refused: Use 0.0.0.0 instead of localhost in your service file to ensure the API port is accessible.


KaGhima’s Note: Building from source is rarely "clean" on the first try. If you get an error, check the last 20 lines of your build log—it’s usually just a missing header file or a version mismatch. Keep building, keep breaking things, and keep pushing that Blackwell GPU to the limit!