Prompt caching in Alex: from 0% to 19% hit rate // DRAGOCZ

Alex (AI assistant for auto shops on CoreSynth) generates a system prompt of hundreds of thousands of tokens per request — static modules, tools, conversation history. With an average of ~259k input tokens per run and pricing at $5/1M uncached vs $0.5/1M cached, prompt caching is the difference between the project being financially viable or not.

OpenAI automatic prefix caching works simply: when the request prefix (tools → instructions → input) stays byte-identical, the API caches it and repeated sections are billed at a tenth of the cost. No special breakpoints, no configuration — just a stable prefix.

We had 0% cache hit rate. Across all 108 runs. Zero.

Diagnosis

I started by auditing the entire pipeline: how the system prompt is built, what changes between requests, what order OpenAI’s Responses API expects (tools → instructions → input).

The main culprit was on the first line:

buildSessionBanner() → "Now: 2026-06-16, 14:32"

The session banner with a timestamp was at the very beginning of instructions. And buildSystemPrompt() was called again at every agent step with new Date(). Result: the moment a minute ticks over, the prefix changes → cache bust → full price for the entire input.

Two secondary culprits:

Cart state — attached to instructions when non-empty. Changed between steps.
customInstructions — per-user instructions in the prefix. Stable per user, but blocked cross-user cache sharing.

Fix: freeze the prefix, move dynamics out

The solution is simple to describe, harder to implement: everything that changes must go behind the static prefix into the current user turn.

BEFORE (prefix changes):
  instructions[0]  = "Now: 14:32, Cart: 3 items, custom: ..."  ← BUST
  instructions[1+] = static .md modules
  tools            = deterministic
  input            = history

AFTER (prefix frozen):
  instructions[0]  = static .md modules                        ← STABLE
  tools            = deterministic
  input            = history + <current_time> + <cart_state> + <custom_instructions>

Volatile context is now assembled via buildVolatileContextXml() — XML blocks at the end of the current user message. Session banner, cart state, custom instructions — all behind the prefix. The prefix is byte-identical across requests, steps, and users.

Results

BEFORE:  108 runs, 28M input tokens, 0 cached   → hit rate 0 %
AFTER:   All 5 tested scenarios showed cache hit → 5–19 %

The proxy propagates cache — the 0% baseline was caused by the volatile prefix, not a proxy issue. After the fix, cached_tokens started appearing in all tested conversations.

What I learned

Prefix caching isn’t magic. OpenAI does it automatically, but only if you give it a stable prefix. One timestamp second at position 0 kills the cache for the entire request.

Measure before and after. Without instrumenting cache_read_input_tokens, I’d never know if the fix worked. I added tracking to the database (schema migration + logging) enabling baseline vs. post-fix comparison.

Dynamics belong at the end. The golden rule for LLM caching: everything static at the front, everything dynamic at the back. Not just timestamps — also cart state, per-user instructions, conversation context.

Tests must pass. 322 tests, lint, TypeScript check — all clean. Prompt caching is a core feature, not a “quick optimization.”

Tech details

PATTERN   → OpenAI automatic prefix caching (Responses API)
FIX       → Volatile context from instructions to user message tail
METRICS   → cache_read_input_tokens (schema migration + DB)
TOOLS     → TypeScript, Prisma, OpenAI Responses API
RESULT    → 0% → 5–19% hit rate, ~90% cost reduction on cached tokens

The project runs on OpenAI GPT-5.5 through the CoreSynth proxy. The goal isn’t just to save money — it’s about making the AI agent financially sustainable even with a massive system prompt and high throughput.