Back to all posts

My LLM Benchmarks Are Bad (And That’s the Point)

In the last post, I talked about why inference matters, particularly how it drives the cost of deploying AI models, and why responses from tools like ChatGPT can swing from snappy to painfully slow.

This post is about the next question:

If inference is so important, how do you actually measure it?

You can read papers, watch keynotes, and scroll X, but the only way any of this really clicked for me was when I built something and watched it struggle in real time.

So that’s what this phase is about:

Build a test rig. Beat on a real model. See where it breaks.


Why Start With Measurement?

In modern AI systems, inference latency isn’t just “how fast the model is.” It’s shaped by:

  • How efficiently you batch requests
  • How you manage GPU memory (KV cache, fragmentation, etc.)
  • How your system schedules workloads and handles bursts
  • A bunch of really good work out there pushes this measurement-first mindset:

  • “Efficient Serving of Large Language Models” (vLLM) — explains how continuous batching can dramatically boost throughput, but also make latency worse if you’re not careful.
  • Amazon Builder’s Library: “Time to First Byte” — shows why you should care way more about tail latency than just the average.
  • Google SRE Book: “Service Level Objectives” — nails the idea that you should optimize for explicit SLOs (like “p95 TTFT < 200ms for chat”), not vibes.
  • So the goal for this phase was simple:

    Build a test rig that can send realistic traffic to an open-source LLM (Meta’s Llama), collect latency data, and clearly show where things fall apart.

    To get started, I went back to the original prompt I gave ChatGPT when I first brainstormed this project. Then I switched to Claude Code to actually help scaffold and ship the code, since it handles long-running refactors and planning really well.


    The Skyline Roadmap (Phase 0 & 1)

    At a high level, Skyline is currently split into two tracks:

    Each phase has three main pieces:

  • A workload generator (what to send, when to send it)
  • A vLLM client (how to talk to the model)
  • A benchmark runner (how to schedule, measure, and report)
  • After some back-and-forth with Claude on how to set this up for future phases, we landed on a pattern:

  • A client side repo that generates traffic and runs benchmarks
  • A server side vLLM instance running on a GPU (AWS EC2 in my case)
  • The client sends a mix of:

  • Short chat prompts — tests interactive latency (most sensitive to TTFT)
  • Long RAG contexts — tests throughput and GPU memory behavior
  • Bursty mixed workloads — tests how the system behaves under spikes
  • That client side is what I call the Workload Generator. It’s the part that pretends to be “real users” hitting the system.


    Designing the Workload Generator

    You can think of the workload generator in two parts:

  • The Recipe → a WorkloadSpec that describes the traffic pattern
  • The Cook → a WorkloadGenerator that turns that recipe into actual “work orders” for the backend
  • RequestSpec: A Single Work Order

    Here’s how a single request is represented:

    # A single inference request
    @dataclass
    class RequestSpec:
        request_id: str              # "req_000042"
        prompt: str                  # "Hello! How are you doing today?"
        input_tokens: int            # 256 (estimated)
        expected_output_tokens: int  # 128
        priority: SLOPriority        # HIGH, MEDIUM, or LOW
        slo_target_ms: int           # 200ms for high-priority chat
        arrival_time: float          # When to send this request
        metadata: Dict[str, Any]     # Extra context
    

    WorkloadGenerator: Turning Recipes into Traffic

    The generator turns a high-level spec into hundreds of timed requests:

    class BaseWorkloadGenerator:
        async def generate(self) -> List[RequestSpec]:
            """Generate all requests upfront (for replay benchmarks)."""
    
        async def generate_stream(self) -> AsyncGenerator[RequestSpec, None]:
            """Generate requests in real-time (for live benchmarks)."""
    

    This separation is important:

    I can swap recipes (chat vs. RAG vs. bursty) without touching the execution logic.

    In practice, you might write a recipe like:

    “Average 256-token prompts, 10 requests per second, mostly HIGH priority.”

    The generator then turns that into timed orders like:

    “At t = 1.23 s, send prompt P, target 128 tokens, priority HIGH, SLO 200 ms.”

    Later, the executor reads each order and sends it to the model, using the SLO and priority metadata to evaluate whether the system is behaving or not.


    Example: The Short Chat Workload

    Here’s a concrete workload that simulates a chat app:

  • Most messages are short to medium length
  • Users arrive randomly around 10 req/s
  • Most traffic is high priority (interactive)
  • # short_chat (chat.py) - Interactive conversations
    SHORT_CHAT_WORKLOAD = WorkloadSpec(
        name="short_chat",
    
        # Input: 256±128 tokens (typical chat message)
        input_length_dist=Distribution(
            type="normal",
            params={"mean": 256, "std": 128},
        ),
    
        # Output: 128±64 tokens (brief response)
        output_length_dist=Distribution(
            type="normal",
            params={"mean": 128, "std": 64},
        ),
    
        # Traffic: Poisson arrivals at 10 req/s
        arrival_pattern=ArrivalPattern(
            type="poisson",
            params={"rate": 10.0},
        ),
    
        # Priority: 70% high-priority (interactive users)
        priority_dist={
            SLOPriority.HIGH: 0.7,    # <200ms TTFT target
            SLOPriority.MEDIUM: 0.3,  # <500ms TTFT target
            SLOPriority.LOW: 0.0,
        },
    
        # Prompt templates for realistic conversation
        prompt_templates=[
            "Hello! How are you doing today?",
            "Can you help me understand how machine learning works?",
            # ... ~20 diverse templates
        ],
    )
    

    Each request that comes out of this workload carries:

  • A realistic prompt
  • A target output length
  • A priority and SLO (for example, 200–500ms TTFT)
  • That’s the “front half” of the system.

    Next up is making it talk to a real engine.


    Talking to vLLM: The VLLMEngine Adapter

    On the server side, I run vLLM exposing an OpenAI-compatible HTTP API.

    On the client side, I built a VLLMEngine that hides all the HTTP details and gives the benchmark runner a simple interface:

    class VLLMEngine:
        """Benchmark-compatible vLLM engine adapter."""
    
        async def __aenter__(self):
            """Initialize HTTP session."""
            self._session = aiohttp.ClientSession()
            return self
    
        async def __aexit__(self, exc_type, exc_val, exc_tb):
            """Clean up HTTP session."""
            if self._session:
                await self._session.close()
    
        async def generate(self, prompt: str, max_tokens: int) -> Dict[str, Any]:
            """
            Send request to vLLM server via OpenAI-compatible API.
    
            Returns a dict like:
            {
                'response': 'Hello! I am doing well...',
                'input_tokens': 256,
                'output_tokens': 128,
                'ttft_ms': 2687.1,         # Time to first token
                'e2e_latency_ms': 8957.2,  # Total latency
                'tokens_per_second': 14.3,
                'success': True,
                'error': None,
            }
            """
    

    The benchmarks don’t care that this is vLLM specifically.

    Later I could swap in SGLang or TensorRT-LLM by implementing the same interface.


    Setting Up the Server (Why EC2?)

    For the backend, I used an Amazon EC2 g5.2xlarge instance:

  • GPU: NVIDIA A10G (24 GB VRAM)
  • Model: meta-llama/Llama-3.1-8B-Instruct (~21 GB loaded)
  • I actually started on Vast.ai because it’s cheap and great for experimentation, but many of the machines I tried were spot-style: they would disappear, and I didn’t want my main benchmarking environment to randomly vanish in the middle of a phase.

    Since I plan to run this project over weeks and months, I ended up going with:

  • EC2 for a stable, repeatable “home base”
  • Keeping Vast.ai in my back pocket for future one-off experiments
  • The real lesson here: don’t overthink infra too early.

    I’d used EC2 before, so I just picked something I knew I could get working quickly.


    The 30,000-Foot View

    After Claude finished scaffolding the project and we ran our first end-to-end tests, I asked it for a “30,000-foot view” of what we had so far.

    Right now we’re solidly in Phase 0 & 1:

  • Phase 0: Mock engine benchmarks to shake out the architecture
  • Phase 1: Real vLLM integration on GPU to see how bad things really are
  • Why start with benchmarking instead of jumping straight into clever scheduling or speculative decoding?

    Because you can’t improve what you haven’t measured.

    Baseline first. Ego later.

    Once you know how the system behaves under realistic load, you can start asking:

    “Is batching the problem? Is memory the problem? Is the scheduler the problem?”


    Inside the Skyline Benchmark Harness

    When I run a command like:

    python bench/real_engine_benchmark.py short_chat 60 vllm
    

    the CLI pulls together all the major components:

  • The workload generator — defines what to send and when
  • The executor — enforces timing and concurrency
  • The metrics system — records what happened
  • The analyzer + reporter — turn raw numbers into summaries
  • How a Benchmark Flows

    Step by step, here’s what happens:

  • Workload generation
  • Engine initialization
  • Concurrent execution
  • Metrics and analysis
  • Reporting
  • Here’s a very condensed view of the concurrency logic:

    semaphore = asyncio.Semaphore(max_concurrent)  # e.g., 50
    
    async def process_request(request: RequestSpec):
        async with semaphore:  # Never exceed max_concurrent
            # Respect arrival time for realistic timing
            now = time.time()
            if request.arrival_time > now:
                await asyncio.sleep(request.arrival_time - now)
    
            # Execute request
            result = await engine.generate(
                request.prompt,
                request.expected_output_tokens,
            )
    
            # Check if SLO was met
            result["slo_met"] = result["ttft_ms"] <= request.slo_target_ms
            return result
    
    async with engine:
        tasks = [process_request(req) for req in requests]
        results = []
        for task in asyncio.as_completed(tasks):
            results.append(await task)
    

    Then we compute stats like:

    ttfts = [r["ttft_ms"] for r in successful_results]
    
    ttft_stats = {
        "mean":   np.mean(ttfts),
        "median": np.median(ttfts),
        "p95":    np.percentile(ttfts, 95),
        "p99":    np.percentile(ttfts, 99),
        "std":    np.std(ttfts),
        "min":    np.min(ttfts),
        "max":    np.max(ttfts),
    }
    
    slo_attainment_rate = sum(r["slo_met"] for r in successful_results) / len(successful_results)
    

    The important thing is the mental model:

  • Workloads define what and when
  • The executor controls how many at once
  • Metrics record what happened
  • The analyzer tells you how bad or good it is
  • That modular design means I can:

  • Swap engines (mock vs vLLM vs SGLang)
  • Add new workloads (for example, “NBA live betting traffic”)
  • Improve metrics without touching the runner

  • Phase 1 Results: How Bad Was It?

    Enough setup. What happened when I actually pointed Skyline at a real GPU?

    Here was the benchmark configuration:

  • Server: AWS g5.2xlarge (NVIDIA A10G, 24 GB VRAM)
  • Model: Llama-3.1-8B-Instruct (~21 GB loaded)
  • Duration: 60 seconds per workload
  • Max Concurrency: 50 in-flight requests
  • Summary Results

    SLO Targets vs Reality:

  • short_chat
  • long_rag
  • bursty_mix
  • If you built a real product on top of this as-is, your users would think it’s broken.

    Which, for our purposes, is perfect.

    Because now we have hard data that says:

    Vanilla “just run vLLM with continuous batching” is absolutely not enough for realistic, mixed workloads.

    So What’s Going Wrong?

    These numbers line up with the failure modes you see in the vLLM paper and more broadly in LLM serving:

  • Head-of-line blocking
  • Memory contention and KV cache pressure
  • No notion of priority or SLOs
  • Single shared pool for prefill and decode
  • You can see all of that tension hiding inside those p95 / p99 TTFT numbers.


    Key Takeaways Before Phase 2

    So what did Phase 1 actually teach me?

  • When you use realistic workloads (short chat + long RAG + bursts) on a real GPU, vanilla vLLM misses SLOs for 90–99% of requests.
  • Our key metric, p95 TTFT, landed in the 5–9 second range across workloads, which is unusable for interactive applications.
  • But now we have quantifiable baselines and a repeatable harness. We know exactly how bad things are, and we have data we can use to validate future improvements.
  • That’s the whole point of these first phases:

    measure first, then optimize.


    What’s Next: Phase 2 (Skyline Proper)

    In Phase 2, Skyline stops being just a measurement tool and starts acting like a real serving layer.

    The plan is to introduce:

  • A router — classify each request by SLO priority and characteristics (short vs long, chat vs RAG).
  • A prefill pool — a dedicated path for context processing, optimized for large batches and long inputs.
  • A decode pool — a dedicated path for token generation, tuned for steady cadence and low TTFT for short requests.
  • KV cache management — treat KV cache like a first-class resource, so we don’t thrash GPU memory or blow up on long contexts.
  • We set a conservative goal of ≥ 25% p95 TTFT improvement over the Phase 1 baseline as the minimum bar. That’s not the end state — it’s just the “if we can’t even beat this, something’s fundamentally wrong” bar.

    From there, the stretch goal is to see how close we can get back toward that mock baseline behavior, but this time on a real GPU, under real load.

    In the next post, I’ll walk through the Skyline router and how we split prefill vs decode so short requests stop sitting behind giant prompts like they’re stuck in line at Costco.

    Stay tuned.