Blog

Your CFO Will Find the AI Cost Problem Before Your CTO Does.

Your AI rollout has a cost problem. Prompt caching is one of the engineering primitives you skipped.

May 13, 2026 5 min read

You approved the AI rollout. Engineers shipped the first agentic workflow. Inference costs came in at three to five times the model. Now your CFO is asking questions the product team cannot answer with a straight face. The problem is not the model you selected. The problem is that nobody treated the prompt as an engineering artifact.

What the data says

Mid-market production deployments are hitting the same ceiling in sequence. First, a pilot runs inside token budget because prompt volume is low and developers are hand-tuning inputs. Second, the workflow scales to real users or real data cadences. Third, inference costs spike — not linearly, but exponentially — because every call is reconstructing the same context from scratch.

Three patterns hold across agentic builds:

Context reconstruction is the silent budget killer. In a well-instrumented agentic system, 60 to 80 percent of the tokens sent per call are static: system instructions, tool schemas, knowledge-base preambles, policy constraints, output format specifications. When none of that content is cached, you are paying full input pricing on material that does not change between calls. On an assistant that handles 10,000 requests per day with a 4,000-token system prompt, that is 40 million tokens of potentially avoidable spend — daily.

Caching configuration varies by platform — and most still require operator decisions. Inference platforms have diverged on their default behavior. Some apply prompt caching automatically once the system prompt crosses a token threshold; calls already in flight benefit from a meaningful discount on cache hits whether or not the team thought about it. Others require explicit opt-in — operators mark which prompt segments are cacheable and absorb the cold-start cost when invalidation happens. Even on auto-cached platforms, the cache hit rate is still operator-controlled through prompt structure. Interleaving dynamic content into a static prefix breaks cache eligibility regardless of which platform you are on. The discipline is not "turn caching on." The discipline is structuring the prompt so the cache that already exists, or the cache you opted into, actually fires.

Agentic systems compound the exposure. A single-turn chatbot has one prompt per session. An agentic system — one with memory retrieval, tool invocation, multi-step planning — may send eight to fifteen model calls to complete a single user task. Each call carries its own system context. If caching is absent at the orchestration layer, every sub-agent call reconstructs in full. The cost is not additive; it is multiplicative by the agent graph depth.

The intervention

The operator playbook is not a caching tutorial. It is a sequencing discipline applied before the first production call is written.

Step one: Classify your prompt content by volatility. Separate every component of your prompt into two bins — static (changes less than once per week) and dynamic (changes per request, per user, or per session). System instructions, persona definitions, output schemas, and tool specifications almost always land in static. User input, retrieved context, and session state land in dynamic. This classification takes two to four hours per workflow. It is not done in most organizations before engineering begins.

Step two: Anchor the cache boundary at the static/dynamic seam. Structure your prompts so that all static content precedes all dynamic content. This is the structural requirement for cache hits. If your orchestration layer inserts dynamic context into the middle of a static system prompt — because that is what felt natural to write — you have broken cache eligibility for everything that follows. The fix is architectural, not cosmetic, and it is significantly more expensive to retrofit than to build correctly.

Step three: Instrument before you optimize. Add cache hit rate and cache miss rate to your inference observability dashboard before you declare the workflow production-grade. If you do not measure it, it will not be optimized. A 70 percent cache hit rate on a 4,000-token system prompt is the difference between a sustainable AI operating model and a budget escalation in Q3.

Step four: Align your batch window with the cache ceiling — and own the cold-start cost when you invalidate. Major inference platforms cap prompt-cache in the five-minute-to-one-hour range. Some are provider-managed and short; some let operators request a longer window for an opt-in cost. Twenty-four-hour caches are not on the menu for prompt caching anywhere. The discipline is not "match caching to your prompt update cadence" — it is "fit your batch window inside the caching ceiling so a single cache write amortizes across the entire batch." If your engineering team iterates the system prompt weekly, accept that every revision triggers a cache cold-start on the next batch; the engineering decision is whether to time prompt updates against the batch cadence so the cold-start happens at a known boundary, or to absorb it as ambient cost. These are not set-and-forget values. They belong in your deployment configuration alongside every other infrastructure parameter.

Where this breaks

Prompt caching does not help if your prompts are genuinely dynamic throughout — if every token sent to the model is specific to that request. Some narrow-context, high-personalization workflows fit this description. If your system prompt is under 1,000 tokens and changes with every call, caching yields minimal return.

Caching also does not substitute for prompt design discipline. A poorly structured prompt that is efficiently cached is still a poorly structured prompt. The optimization sequence matters: design for correctness first, then structure for cache eligibility, then instrument. Reversing that order produces fast, cheap, wrong outputs.

Finally, caching decisions made at the individual workflow level create governance debt at the program level. When ten teams are each making independent caching configuration decisions, you do not have an AI operating model — you have ten local optimizations with no aggregate visibility. The cost ceiling you are hitting is often a coordination failure, not a technical one.


The 30-day diagnostic is straightforward: pull your inference logs, calculate the ratio of cached to uncached tokens per workflow, and identify the three highest-volume agentic calls. That measurement will tell you whether you have an engineering problem, a design problem, or a governance problem — and the answer determines what you fix first. If your observability layer does not currently expose cache hit rate, that is the first week's work.


At Northbeam, our own content engine runs weekly batches for exactly this reason. Generating each scheduled brief at its individual publication date — rather than amortizing across a single batch window inside the extended cache ceiling — would increase our input-token spend by roughly eighty percent. The discipline is not abstract for us. We measured what the cold-cache cost looks like across our own publication pipeline and chose the batch cadence that matched the cache windows. That is what the engineering primitive looks like in production.

Bill Tennant is Founder & Principal of Northbeam Solutions. Northbeam embeds alongside business and technical teams, builds production-grade AI work with their people, and leaves behind the engineering rigor, governance, and capability that separates AI operational efficiency from AI experimentation.