Alex (AI assistant for auto shops on CoreSynth) generates a system prompt of hundreds of thousands of tokens per request — static modules, tools, conversation history. With an average of ~259k input tokens per run and pricing at $5/1M uncached vs $0.5/1M cached, prompt caching is the difference between the project being financially viable or not.
OpenAI automatic prefix caching works simply: when the request prefix (tools → instructions → input) stays byte-identical, the API caches it and repeated sections are billed at a tenth of the cost. No special breakpoints, no configuration — just a stable prefix.
We had 0% cache hit rate. Across all 108 runs. Zero.
Diagnosis
I started by auditing the entire pipeline: how the system prompt is built, what changes between requests, what order OpenAI’s Responses API expects (tools → instructions → input).
The main culprit was on the first line:
buildSessionBanner() → "Now: 2026-06-16, 14:32"
The session banner with a timestamp was at the very beginning of instructions. And buildSystemPrompt() was called again at every agent step with new Date(). Result: the moment a minute ticks over, the prefix changes → cache bust → full price for the entire input.
Two secondary culprits:
- Cart state — attached to
instructionswhen non-empty. Changed between steps. - customInstructions — per-user instructions in the prefix. Stable per user, but blocked cross-user cache sharing.
Fix: freeze the prefix, move dynamics out
The solution is simple to describe, harder to implement: everything that changes must go behind the static prefix into the current user turn.
BEFORE (prefix changes):
instructions[0] = "Now: 14:32, Cart: 3 items, custom: ..." ← BUST
instructions[1+] = static .md modules
tools = deterministic
input = history
AFTER (prefix frozen):
instructions[0] = static .md modules ← STABLE
tools = deterministic
input = history + <current_time> + <cart_state> + <custom_instructions>
Volatile context is now assembled via buildVolatileContextXml() — XML blocks at the end of the current user message. Session banner, cart state, custom instructions — all behind the prefix. The prefix is byte-identical across requests, steps, and users.
Results
BEFORE: 108 runs, 28M input tokens, 0 cached → hit rate 0 %
AFTER: All 5 tested scenarios showed cache hit → 5–19 %
The proxy propagates cache — the 0% baseline was caused by the volatile prefix, not a proxy issue. After the fix, cached_tokens started appearing in all tested conversations.
What I learned
Prefix caching isn’t magic. OpenAI does it automatically, but only if you give it a stable prefix. One timestamp second at position 0 kills the cache for the entire request.
Measure before and after. Without instrumenting cache_read_input_tokens, I’d never know if the fix worked. I added tracking to the database (schema migration + logging) enabling baseline vs. post-fix comparison.
Dynamics belong at the end. The golden rule for LLM caching: everything static at the front, everything dynamic at the back. Not just timestamps — also cart state, per-user instructions, conversation context.
Tests must pass. 322 tests, lint, TypeScript check — all clean. Prompt caching is a core feature, not a “quick optimization.”
Tech details
PATTERN → OpenAI automatic prefix caching (Responses API)
FIX → Volatile context from instructions to user message tail
METRICS → cache_read_input_tokens (schema migration + DB)
TOOLS → TypeScript, Prisma, OpenAI Responses API
RESULT → 0% → 5–19% hit rate, ~90% cost reduction on cached tokens
The project runs on OpenAI GPT-5.5 through the CoreSynth proxy. The goal isn’t just to save money — it’s about making the AI agent financially sustainable even with a massive system prompt and high throughput.