Back to all posts

What Happens if I Hook My AI Engine to a GPU?

Watching AI from the Nosebleeds

As a casual bystander in the world of AI it’s easy to get lost in the latest and sexiest LLMs releases. Most headlines these day focus on who might win the the so called “AI race” between the US and China.

All that is entertaining (or maybe not), but what catches my attention more and more as an engineer watching from the nosebleeds seats, is the growing discussion about the importance of inference.

The Industry is Stuck on One Word: Inference

If you ever watched one of Nvidia’s recent GTC keynotes, Jensen Huang frequently highlights the shift toward inference as a pivotal theme in modern AI.

He says things like:

"AI inference will be 1000x bigger than training.”
"Pre-training was hard, inference easy; now everything is hard.”

Amazon’s CEO Andy Jassy has echoed the same:

“Most of the spend today in AI has been on training. But when you get to scale of the production applications for AI, it flips. The overwhelming majority of the spend is actually on inference — the predictions of the model.”

Noam Brown, a research lead at OpenAI, has also explained:

“Increasing inference compute is much more cost-effective than training compute, potentially by orders of magnitude”.

Conclusion? Inference is expensive, and pushing down the cost is the only way large models will scale.

I Have Questions!

Why is compute so expensive!?

When we send a prompt to an LLM like ChatGPT, is that entire round trip what we mean by inference?

Why do some responses come back instantly while others take several seconds?

And if latency varies that much, can anything be done to make it faster?

Only One thing to do, Build.

Those questions led me right back to ChatGPT. I wanted to understand inference on a deeper level, and not just read about it. This is when my builder brain took over.

So I asked ChatGPT what kind of project would help me understand how inference actually works.

“Skyline - SLO‑aware LLM Serving with Prefill/Decode Disaggregation: A drop‑in serving layer (Rust or Python+CUDA) on top of vLLM/SGLang that (a) continuously batches requests, (b) separates prefill and decode across GPU pools, and (c) implements SLO‑aware scheduling with preemption (low TTFT requests get priority without destroying throughput). Why now: DistServe‑style prefill/decode disaggregation and SLO scheduling are increasingly standard for production; few open projects make them turnkey. You learn: batching, load‑balancing, queueing theory, NCCL graph, CUDA streams; reliability under churn. Benchmarks to publish: p50/p99 TTFT and tok/s vs vLLM baseline at mixed workloads (short chat + long RAG), SLO attainment (%) under load.”

Hooked by the Idea

Not exactly sure what the hell any of the above meant but the name “Skyline” was at least dope enough to peak my interest.

As of now my goal for this project is to learn why inference optimization matters and how to make it more efficient. Luckily, there are already accepted best practices for improving LLM response time, and the first step is to establish a baseline runtime for the model.

At the simplest level, it appears we need three things:

  • A GPU
  • An open-source LLM
  • A way to send text input, receive a response, and measure the time it takes.
  • I began with a single GPU, though eventually the setup will expand to a cluster of GPUs to reflect how large scale systems serve models in production.

    Where vLLM Enters the Picture

    One challenge is that models are not efficient at handling continuous streams of tokens during inference. Frameworks such as vLLM exist to solve that problem. They batch multiple requests together to keep the GPU fully utilized, which minimizes idle time and prevents wasted compute cycles.

    Though the batching step sounds simple enough, it has a huge impact on throughput and latency. As I would later learn, it is also the source of several bottlenecks that cause those unpredictable delays when you use LLMs in practice.

    What Comes Next

    This project, Skyline, is my way of diving deeper beneath the headlines. It is an attempt to understand, by building from the ground up, why inference is both the most expensive and the most interesting problem in AI today. In the next post, I will share what happened when I connected my first real inference engine to a GPU and watched it fail spectacularly.