12 March, 2026

Hitchhikers Guide to the World of Inference: the Bumpy Road to Scalability

In this post we go deep on two challenges we've hit scaling evroc Think Models - our managed inference service. The first is a debugging story: a parser bug that made Kimi K2.5 functionally broken for agentic workloads, and the investigation that led to the fix. The second is a continuous release problem: how do you maintain a competitive OSS model catalogue across a live GPU fleet without downtime or SLA impact?

How do these two challenges relate to each other? Read on to find the answer.

What have we built? Our Technical Architecture

evroc is the European Cloud with three different regions (Stockholm, Frankfurt, Paris). In each of those regions we will have three availability zones. We will scale our AI services accordingly, in multiple regions over the next couple of years.

For the first iteration of our inference platform we are running in the Stockholm region. Our evroc Think Models inference stack has four main components:

Console / UI

API Mgmt Layer | Inference Router

GPU Orchestration Engine

GPU Worker Node (DGX B200)

0 1 2 3 4 5 6 7

NVLink / NVSwitch

Starting with the bottom layer of the stack obviously we have our hardware. We're running NVIDIA DGX B200 systems, each with 8 Blackwell GPUs interconnected via fifth-generation NVLink. The B200 delivers roughly 57% faster training performance compared to H100, with a single B200 matching the performance of 3-4 H100 GPUs for inference workloads.

These machines are impressive - and loud. At 120 decibels during operation, they're louder than a rock concert. The thermal and acoustic challenges of running this hardware at scale in a data center are non-trivial, but that's a topic for another post.

GPUs are expensive. Leaving them idle burns money. Our GPU Orchestration Engine schedules models across the fleet based on their parameter counts and configuration requirements. We're also making use of fractional GPU workloads so we can make more efficient use of our hardware.

The Inference Router redirects OpenAI API compatible requests depending on which model you have chosen from our model catalogue towards the actual model deployments. This typically would look like something like this:

curl -X POST https://models.think.evroc.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer THINK_API_KEY" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
      {
        "role": "system",
        "content": "You are travelling the galaxy."
      },
      {
        "role": "user",
        "content": "In which galaxy am I now?"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

The inference router and GPU orchestration engine would typically be in the path of the data plane to handle chat (completions) streaming requests, embeddings and/or audio transcriptions. The Management Plane on the other hand is where you can fetch API keys. In this layer you can create, delete or update API keys being used to access our inference service.

What challenges did we run into and how did we solve them?

Where it gets interesting is when you're hitting bugs. Let's take an example of a bug we found, and how we subsequently rolled out a bug fix and what we take into consideration when doing so.

Challenge #1: Fixing bugs in the world of inference

Background

We have evaluated and use the most advanced open source models for coding tasks and agentic development. When we were evaluating Kimi K2.5 we discovered the following bug.

Kimi K2.5 is designed to interleave chain-of-thought reasoning with structured tool calls - it thinks, then acts, potentially for hundreds of steps in a single session. This makes correct parser behavior load-bearing: a parser failure mid-session doesn't just break one tool call, it corrupts the conversation history and can cascade through subsequent turns.

The Bug

Tool calls intermittently failed during longer agentic sessions. Instead of vLLM delivering a structured tool call, the model appears to "hand the turn back" mid-thought. The client received no tool call to execute and the sessions stalled.

In the vLLM logs and the client UI (OpenCode, IDEs, etc.) raw internal model control tokens were visible inside the reasoning/thinking block:

Thinking: <|tool_calls_section_begin|> <|tool_call_begin|> functions.read:2
<|tool_call_argument_begin|> {"filePath": "...", "offset": 1691, "limit": 100}
<|tool_call_end|> <|tool_calls_section_end|>

These tokens are meant to be intercepted by vLLM's tool call parser and converted into a structured API response. Instead they were appearing as plain text in the thinking display, with tool_calls: null in the response.

Simultaneously, vLLM logs showed a secondary crash:

IndexError: list index out of range
  File "kimi_k2_tool_parser.py", line 522, in extract_tool_calls_streaming

The Dead Ends

Buffer overflow: Oversized tool call arguments were causing a buffer overflow that led to raw tokens in the response. We concluded that this caused silent argument truncation, not token leakage.

Prompt corruption: When an assistant token contains an empty content field, vLLM internally promotes this to a structured content list. Kimi's chat template wasn't designed for this format and inserts a malformed value in the prompt on subsequent turns. This corrupts the model's context and was a strong candidate explaining why failures become more frequent over longer sessions. Potentially a contributing factor, but unable to confirm as the root cause.

Reasoning token mismatch: The reasoning parser of Kimi delegates to a DeepSeek parser. One hypothesis could be that Kimi uses different token strings for its thinking delimiters than what the DeepSeek parser expects, making it effectively blind. However, this is incorrect: the tokens do match and pointed us directly to the reasoning parser as the right component to investigate.

The Root Cause

Kimi K2.5 occasionally skips emitting its reasoning end token and jumps directly to emitting tool call tokens. This is a known model behavior - more likely in longer context sessions or under context pressure.

When this happens, vLLM's reasoning parser has no signal that the reasoning block has ended. Its job is to route tokens to either the thinking output or the main content stream, and without the close token it keeps routing everything - including tool call control tokens - into the thinking block. The tool call parser downstream never sees those tokens and produces tool_calls: null.

The IndexError crash is a separate but related issue: under certain streaming conditions, the tool call parser's internal state gets out of sync when a specific token arrives as a single-unit delta, causing a bounds check failure on the next delta. This can be triggered both organically and by the malformed token stream that results from the reasoning parser leak.

The Fix

The idea: when the model skips its reasoning end token and jumps directly to a tool call, the reasoning parser should recognize the tool call control tokens as an implicit signal that reasoning has ended. From that point on, all subsequent tokens should be routed to the content stream so the tool call parser can handle them correctly.

The implementation: firstly, the parser needs to know which tokens signal the start of a tool call section - these are resolved from the actual model tokenizer rather than hardcoded, keeping the fix correct across tokenizer updates. Secondly, once a tool call has started, that state needs to persist: subsequent tokens in the tool call body arrive without any special markers, so a stateless check would miss them.

This only activates when the reasoning end token has not been seen. If the model ends its reasoning block normally, the existing behavior is completely unchanged.

Challenge #2: Release to prod - the Quadrant of Trade-offs

Background

The first challenge was a bug. The harder question was what came next: we had a patched model that needed to go back into production, on a live fleet, without interrupting running sessions. That's not a one-off problem - it's the continuous reality of running a managed inference service.

How do you maintain a competitive OSS model catalogue that's performant, relevant, and commercially viable, when every update - bug fix or otherwise - has to ship without downtime and land cleanly across the full quadrant of trade-offs?

Model Size & Precision

Memory footprint is the first constraint. A 70B model at FP16 needs ~140GB VRAM - two full B200s before you've allocated anything for KV cache or activations. Quantization changes the calculus: FP8 gets you near-FP16 quality at half the memory cost and higher throughput. FP4 pushes further but degrades on reasoning-heavy tasks and long-context coherence.

The decision isn't just what precision can we run, but what precision should we run per model per use case. A coding assistant needs high precision on tool-call accuracy. A summarization endpoint is usually fine at INT8.

For example, a 7B parameter model needs roughly 14GB in half precision (FP16) or 7GB in INT8 quantization.

Model Type and Configuration

KV cache, context window, and batch size are in direct tension. A 128K context model under high concurrency will exhaust VRAM fast - you're choosing between many short conversations or fewer long ones.

Speculative decoding can deliver 2-3x throughput gains using a small draft model, but adds memory overhead and latency variance. It works well for structured outputs (tool calls, JSON) and less well for high-temperature generation. We're increasingly thinking in terms of deployment profiles rather than a single config per model.

Model Performance and Benchmarking

Benchmarking inference is harder than it looks, and vLLM makes it harder. The project moves fast - sometimes too fast. New model support often lands in nightly builds weeks before it stabilises into a release. If you want to run the latest models, you're often on nightly.

The parameter space compounds this. Tensor parallelism, chunked prefill, prefix caching, speculative decoding draft model size - the interactions between these are non-obvious and not always well documented. Performance on a given configuration is part engineering, part intuition, and part luck. Everyone who runs inference at scale eventually develops their own set of rituals for benchmarking.

Rather than building a bespoke benchmarking harness, we instrumented an existing one. We use GuideLLM as our primary load testing tool, wrapped with Ansible to enforce reproducibility across runs - same environment, same parameters, same baseline every time. It's not the only approach, and the tooling landscape here moves quickly enough that what's best practice today may look dated in six months. But reproducibility matters more than novelty: the goal is a benchmark you can trust to compare apples to apples across vLLM versions and model configurations.

The other thing benchmarks consistently get wrong is the workload. Standard throughput and latency numbers are measured against synthetic request distributions that don't reflect how real customers use the system. In practice, production workloads are friendlier: repeated system prompts get absorbed by prefix caching, agentic sessions with structured outputs benefit from speculative decoding in ways synthetic benchmarks don't capture, and concurrent load patterns tend to be bursty rather than uniform.

GPU Hardware

This is where the axes converge. Fractional GPU allocation lets us co-locate smaller models on a single physical device, improving utilisation - but introduces memory bandwidth contention. You're slicing more than just memory: bandwidth and SM cores are partitioned too, so each tenant operates within hard resource ceilings, not just a VRAM budget. The win is real when load patterns are naturally staggered or when one workload is memory-bound and the other compute-bound.

Conclusions

The inference space moves at a rapid pace. New models drop, vLLM nightly introduces fixes and breaks, and customer workload patterns shift - all while the GPU fleet runs continuously and SLAs don't pause for maintenance windows.

The two challenges in this post can be read independently - one is a debugging story, one is a release strategy problem - but they're more entangled than they appear. The Kimi K2.5 parser bug didn't just cause intermittent failures: it made the model functionally broken for agentic workloads. Does the corrected token routing change latency profiles? Does it interact with prefix caching or speculative decoding in ways that shift throughput?

That's the core tension: every change to the inference stack is simultaneously a correctness decision, a performance decision, and a release risk. The quadrant of trade-offs isn't something you resolve once - it's something you navigate continuously, with each new model version, each vLLM upgrade. What we're building toward is a release process that treats all three dimensions together: functional correctness, performance under realistic load, and safe rollout.

Looking Ahead

Our 2026 roadmap focuses on three areas:

Multi-zone, multi-region deployment for 99.99% SLA. Today we run in Stockholm. We're expanding to Paris and Frankfurt, with automatic failover across regions.

Silicon agnosticism. NVIDIA dominates today, but we're architecting our orchestration layer to support alternative accelerators as they mature. AMD and European silicon initiatives are all on our radar.

Smarter orchestration. Dynamic workload scheduling, inference delivery networks that route requests to optimal endpoints, and better bin-packing algorithms to maximize utilization, scaling audio transcriptions.

The European Cloud

A better cloud. Built for AI.

The European Cloud

About Newsroom Careers Legal Contact