My LLM Benchmarks Are Bad (And That’s the Point)
In the last post, I talked about why inference matters, particularly how it drives the cost of deploying AI models, and why responses from tools like ChatGPT can swing from snappy to painfully slow.
This post is about the next question:
If inference is so important, how do you actually measure it?
You can read papers, watch keynotes, and scroll X, but the only way any of this really clicked for me was when I built something and watched it struggle in real time.
So that’s what this phase is about:
Build a test rig. Beat on a real model. See where it breaks.
Why Start With Measurement?
In modern AI systems, inference latency isn’t just “how fast the model is.” It’s shaped by:
A bunch of really good work out there pushes this measurement-first mindset:
So the goal for this phase was simple:
Build a test rig that can send realistic traffic to an open-source LLM (Meta’s Llama), collect latency data, and clearly show where things fall apart.
To get started, I went back to the original prompt I gave ChatGPT when I first brainstormed this project. Then I switched to Claude Code to actually help scaffold and ship the code, since it handles long-running refactors and planning really well.
The Skyline Roadmap (Phase 0 & 1)
At a high level, Skyline is currently split into two tracks:
Each phase has three main pieces:
After some back-and-forth with Claude on how to set this up for future phases, we landed on a pattern:
The client sends a mix of:
That client side is what I call the Workload Generator. It’s the part that pretends to be “real users” hitting the system.
Designing the Workload Generator
You can think of the workload generator in two parts:
WorkloadSpec that describes the traffic patternWorkloadGenerator that turns that recipe into actual “work orders” for the backendRequestSpec: A Single Work Order
Here’s how a single request is represented:
# A single inference request
@dataclass
class RequestSpec:
request_id: str # "req_000042"
prompt: str # "Hello! How are you doing today?"
input_tokens: int # 256 (estimated)
expected_output_tokens: int # 128
priority: SLOPriority # HIGH, MEDIUM, or LOW
slo_target_ms: int # 200ms for high-priority chat
arrival_time: float # When to send this request
metadata: Dict[str, Any] # Extra context
WorkloadGenerator: Turning Recipes into Traffic
The generator turns a high-level spec into hundreds of timed requests:
class BaseWorkloadGenerator:
async def generate(self) -> List[RequestSpec]:
"""Generate all requests upfront (for replay benchmarks)."""
async def generate_stream(self) -> AsyncGenerator[RequestSpec, None]:
"""Generate requests in real-time (for live benchmarks)."""
This separation is important:
I can swap recipes (chat vs. RAG vs. bursty) without touching the execution logic.
In practice, you might write a recipe like:
“Average 256-token prompts, 10 requests per second, mostly HIGH priority.”
The generator then turns that into timed orders like:
“At t = 1.23 s, send prompt P, target 128 tokens, priority HIGH, SLO 200 ms.”
Later, the executor reads each order and sends it to the model, using the SLO and priority metadata to evaluate whether the system is behaving or not.
Example: The Short Chat Workload
Here’s a concrete workload that simulates a chat app:
# short_chat (chat.py) - Interactive conversations
SHORT_CHAT_WORKLOAD = WorkloadSpec(
name="short_chat",
# Input: 256±128 tokens (typical chat message)
input_length_dist=Distribution(
type="normal",
params={"mean": 256, "std": 128},
),
# Output: 128±64 tokens (brief response)
output_length_dist=Distribution(
type="normal",
params={"mean": 128, "std": 64},
),
# Traffic: Poisson arrivals at 10 req/s
arrival_pattern=ArrivalPattern(
type="poisson",
params={"rate": 10.0},
),
# Priority: 70% high-priority (interactive users)
priority_dist={
SLOPriority.HIGH: 0.7, # <200ms TTFT target
SLOPriority.MEDIUM: 0.3, # <500ms TTFT target
SLOPriority.LOW: 0.0,
},
# Prompt templates for realistic conversation
prompt_templates=[
"Hello! How are you doing today?",
"Can you help me understand how machine learning works?",
# ... ~20 diverse templates
],
)
Each request that comes out of this workload carries:
That’s the “front half” of the system.
Next up is making it talk to a real engine.
Talking to vLLM: The VLLMEngine Adapter
On the server side, I run vLLM exposing an OpenAI-compatible HTTP API.
On the client side, I built a VLLMEngine that hides all the HTTP details and gives the benchmark runner a simple interface:
class VLLMEngine:
"""Benchmark-compatible vLLM engine adapter."""
async def __aenter__(self):
"""Initialize HTTP session."""
self._session = aiohttp.ClientSession()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Clean up HTTP session."""
if self._session:
await self._session.close()
async def generate(self, prompt: str, max_tokens: int) -> Dict[str, Any]:
"""
Send request to vLLM server via OpenAI-compatible API.
Returns a dict like:
{
'response': 'Hello! I am doing well...',
'input_tokens': 256,
'output_tokens': 128,
'ttft_ms': 2687.1, # Time to first token
'e2e_latency_ms': 8957.2, # Total latency
'tokens_per_second': 14.3,
'success': True,
'error': None,
}
"""
The benchmarks don’t care that this is vLLM specifically.
Later I could swap in SGLang or TensorRT-LLM by implementing the same interface.
Setting Up the Server (Why EC2?)
For the backend, I used an Amazon EC2 g5.2xlarge instance:
meta-llama/Llama-3.1-8B-Instruct (~21 GB loaded)I actually started on Vast.ai because it’s cheap and great for experimentation, but many of the machines I tried were spot-style: they would disappear, and I didn’t want my main benchmarking environment to randomly vanish in the middle of a phase.
Since I plan to run this project over weeks and months, I ended up going with:
The real lesson here: don’t overthink infra too early.
I’d used EC2 before, so I just picked something I knew I could get working quickly.
The 30,000-Foot View
After Claude finished scaffolding the project and we ran our first end-to-end tests, I asked it for a “30,000-foot view” of what we had so far.
Right now we’re solidly in Phase 0 & 1:
Why start with benchmarking instead of jumping straight into clever scheduling or speculative decoding?
Because you can’t improve what you haven’t measured.
Baseline first. Ego later.
Once you know how the system behaves under realistic load, you can start asking:
“Is batching the problem? Is memory the problem? Is the scheduler the problem?”
Inside the Skyline Benchmark Harness
When I run a command like:
python bench/real_engine_benchmark.py short_chat 60 vllm
the CLI pulls together all the major components:
How a Benchmark Flows
Step by step, here’s what happens:
Here’s a very condensed view of the concurrency logic:
semaphore = asyncio.Semaphore(max_concurrent) # e.g., 50
async def process_request(request: RequestSpec):
async with semaphore: # Never exceed max_concurrent
# Respect arrival time for realistic timing
now = time.time()
if request.arrival_time > now:
await asyncio.sleep(request.arrival_time - now)
# Execute request
result = await engine.generate(
request.prompt,
request.expected_output_tokens,
)
# Check if SLO was met
result["slo_met"] = result["ttft_ms"] <= request.slo_target_ms
return result
async with engine:
tasks = [process_request(req) for req in requests]
results = []
for task in asyncio.as_completed(tasks):
results.append(await task)
Then we compute stats like:
ttfts = [r["ttft_ms"] for r in successful_results]
ttft_stats = {
"mean": np.mean(ttfts),
"median": np.median(ttfts),
"p95": np.percentile(ttfts, 95),
"p99": np.percentile(ttfts, 99),
"std": np.std(ttfts),
"min": np.min(ttfts),
"max": np.max(ttfts),
}
slo_attainment_rate = sum(r["slo_met"] for r in successful_results) / len(successful_results)
The important thing is the mental model:
That modular design means I can:
Phase 1 Results: How Bad Was It?
Enough setup. What happened when I actually pointed Skyline at a real GPU?
Here was the benchmark configuration:
Summary Results
SLO Targets vs Reality:
If you built a real product on top of this as-is, your users would think it’s broken.
Which, for our purposes, is perfect.
Because now we have hard data that says:
Vanilla “just run vLLM with continuous batching” is absolutely not enough for realistic, mixed workloads.
So What’s Going Wrong?
These numbers line up with the failure modes you see in the vLLM paper and more broadly in LLM serving:
You can see all of that tension hiding inside those p95 / p99 TTFT numbers.
Key Takeaways Before Phase 2
So what did Phase 1 actually teach me?
That’s the whole point of these first phases:
measure first, then optimize.
What’s Next: Phase 2 (Skyline Proper)
In Phase 2, Skyline stops being just a measurement tool and starts acting like a real serving layer.
The plan is to introduce:
We set a conservative goal of ≥ 25% p95 TTFT improvement over the Phase 1 baseline as the minimum bar. That’s not the end state — it’s just the “if we can’t even beat this, something’s fundamentally wrong” bar.
From there, the stretch goal is to see how close we can get back toward that mock baseline behavior, but this time on a real GPU, under real load.
In the next post, I’ll walk through the Skyline router and how we split prefill vs decode so short requests stop sitting behind giant prompts like they’re stuck in line at Costco.
Stay tuned.