Skip to main content

Command Palette

Search for a command to run...

From Notebook Jail to Production: Scaling LLMs on NVIDIA Blackwell

Updated
3 min read
From Notebook Jail to Production: Scaling LLMs on NVIDIA Blackwell
B
About Me I am a Computer Science Engineer and Data Science Master’s graduate, currently building the future of Arabic text digitization at Kalima OCR. My work focuses on bridging the gap between cutting-edge AI research and production-ready SaaS applications. I thrive at the intersection of AI and DevOps. Whether it's training and fine-tuning models in PyTorch, orchestrating high-throughput inference on NVIDIA Blackwell (GB10) hardware, or building responsive UIs with Next.js, I enjoy the full-stack challenge of turning complex research ideas into usable, sovereign data applications. What I’m working on now: Deploying large-scale Mixture-of-Experts (MoE) models on DGX infrastructure. Optimizing LLM inference pipelines (vLLM, LiteLLM) for low-latency production environments. Building developer-centric AI portals to make high-end models accessible to product teams. I share my journey here to help other engineers navigate the messy, high-pressure world of "DevOps-for-AI." Expect deep dives into hardware, model optimization, and the occasional battle with a broken CUDA kernel.

After finishing my PFE (Final Year Project), I became obsessed with production environments. There is a massive, often painful difference between a model that runs in a Jupyter Notebook and one that actually serves a real-world SaaS application.

Recently, I deployed a production-grade Qwen 3.5 35B MoE model on NVIDIA’s latest Blackwell (GB10) hardware, and I want to share the infrastructure you actually need if you're ready to escape "Notebook Jail."

Why leave the notebook?

Because production requires stability, security, and scalability. When users depend on your platform, you can't rely on a notebook environment that crashes when you look at it the wrong way. You need a robust, enterprise-grade stack.

The Stack Overview

This is the infrastructure I built on the DGX Spark. It’s designed to be modular, fast, and, most importantly, stable.

DGX Spark infrastructure diagram — Qwen 3.5 35B MoE on NVIDIA Blackwell

Figure — Infrastructure overview for running Qwen 3.5 35B MoE on NVIDIA Blackwell (DGX Spark).

  1. The Engine (vLLM): There are three major contenders for inference: vLLM, Ollama, and TensorRT-LLM. I’ve tested them all, and I’ve settled on vLLM for production. (Stay tuned—I’m dedicating Blog #3 to why Ollama, while great for dev, isn't always the right fit for high-concurrency production).

  2. The Gateway (LiteLLM): When you’re running multiple models (like Qwen 14B for speed and 35B for reasoning), you need a router. LiteLLM acts as the "brain," managing traffic, API keys, and routing so your frontend doesn't have to change every time you swap a model.

  3. The UI (Open WebUI): We aren't here to reinvent the wheel. We use Open WebUI to provide a polished, enterprise-ready interface that my team can use immediately.

What’s Coming Next?

This is the start of a 5-part series where I’ll break down the architecture, the specific "Blackwell hacks" I had to implement, and how to expose your model securely to your team.

Here is the roadmap:

  • Blog 1.1: Compiling vLLM from source for Blackwell (The "No Build Isolation" guide).

  • Blog 2: Orchestrating our production stack with Docker.

  • Blog 3: The truth about Ollama in Production (and why it's not enough).

  • Blog 4: Secure Access: Using Cloudflare Tunnels for your AI.

  • Blog 5: Building a custom Branded Developer Portal.


Are you currently running your models in a notebook, or have you already made the jump to production?

The Blackwell Blueprint

Part 1 of 1

A deep-dive series on deploying high-performance LLMs in production. From native vLLM compilation on NVIDIA Blackwell (GB10) to API gateways, secure tunnels, and custom developer portals. Real-world infrastructure for the modern AI engineer.