Step-by-Step — Compiling vLLM from Source on NVIDIA Blackwell (GB10)

If you've been trying to run standard vLLM Docker images on the new NVIDIA DGX Spark, you’ve probably hit a wall. Between the ARM64 (Grace) architecture, the Blackwell (GB10) GPU, and the requirements of CUDA 13, the "easy way" doesn't work yet.
To get the performance your hardware deserves, we have to compile vLLM natively. Here is the exact build guide to get it running.
1. System Dependencies
Standard Ubuntu installs often lack the headers required to compile Python-to-C++ bindings. Make sure you have the basics:
sudo apt-get update
sudo apt-get install -y gcc-12 g++-12 build-essential python3-dev
2. Environment Configuration
We need to point the compiler directly to the CUDA 13 toolkit and specify the architecture list for Blackwell. Add these to your shell (or ~/.bashrc):
export CUDA_HOME=/usr/local/cuda-13.0
export PATH="\(CUDA_HOME/bin:\)PATH"
export LD_LIBRARY_PATH="\(CUDA_HOME/lib64:\)LD_LIBRARY_PATH"
# Point to GCC-12
export CC=/usr/bin/gcc-12
export CXX=/usr/bin/g++-12
export CUDAHOSTCXX=/usr/bin/g++-12
# Blackwell Architecture
export TORCH_CUDA_ARCH_LIST="9.0;12.0;12.1"
export MAX_JOBS=16
3. Virtual Environment Setup
Keep your build environment clean. Do not do this globally.
python3 -m venv vllm_env
source vllm_env/bin/activate
# Install build-time dependencies
pip install --upgrade pip setuptools wheel
pip install numpy ninja packaging pybind11 "setuptools>=61" "setuptools-scm>=8" cmake
4. The "No-Build-Isolation" Build
This is the most critical step. Standard pip builds often fail because they create an isolated environment that can't see your optimized system drivers. We bypass this to use the environment we just set up.
git clone https://github.com/vllm-project/vllm.git
cd vllm
# Force build in the current environment
export VLLM_TARGET_DEVICE=cuda
pip install -e . --no-build-isolation
5. Stability Flags (Blackwell/GB10)
Blackwell is so new that standard kernels sometimes struggle. For production, I recommend these flags in your systemd service file to prevent kernel crashes:
[Service]
# Disable experimental V1 engine and Marlin kernels for stability
Environment="VLLM_USE_V1=0"
Environment="VLLM_USE_MARLIN=0"
ExecStart=/path/to/venv/python -m vllm.entrypoints.openai.api_server \
--model /path/to/model \
--quantization gptq \
--dtype float16 \
--enforce-eager \
--gpu-memory-utilization 0.85 \
--max-model-len 4096
Troubleshooting Common Errors
TypeError: unsupported operand type(s) for |: 'list' and 'set': Update transformers:
pip install git+https://github.com/huggingface/transformers.gitValueError: Free memory on device...: You’ve allocated too much memory. Lower your
--gpu-memory-utilization.Connection Refused: Use
0.0.0.0instead oflocalhostin your service file to ensure the API port is accessible.
KaGhima’s Note: Building from source is rarely "clean" on the first try. If you get an error, check the last 20 lines of your build log—it’s usually just a missing header file or a version mismatch. Keep building, keep breaking things, and keep pushing that Blackwell GPU to the limit!

