If you have built any substantial application on Claude's API, you know the billing reality: every token counts. For applications that need to provide Claude with a large system prompt, a lengthy knowledge base, or a detailed context document on every API call, the input token costs accumulate fast. A 50,000-token system prompt sent 100 times a day is 5 million tokens daily — at Sonnet 4.5 prices, that is $15 per day just in system prompt tokens, before any user messages or outputs are counted. Prompt caching changes this math dramatically.
The feature, now generally available across all Claude models in the Anthropic API, allows developers to mark specific parts of their prompt as cacheable. When the same cacheable content appears in subsequent requests, it is served from cache rather than processed again. Cache reads are billed at 10% of the standard input token price. For applications where a large, stable context forms the majority of each prompt, this translates directly to a cost reduction of up to 90% on input tokens.
How Prompt Caching Works
The implementation is straightforward. In the API request, developers add a cache_control parameter with type: "ephemeral" to any content block they want to cache. Anthropic's infrastructure stores a compressed representation of the processed key-value pairs from that content. On the next request containing the same cached prefix, the model can skip reprocessing those tokens and read directly from the cache. The cache persists for 5 minutes by default, refreshed on each cache hit, making it ideal for applications with regular traffic patterns.
Prompt Caching: Quick Reference
Add to any content block in your API request:
{"type": "text", "text": "...your context...",
"cache_control": {"type": "ephemeral"}}
- Cache read price10% of input rate
- Cache write price25% premium on first write
- Cache TTL (default)5 minutes
- Minimum cacheable tokens1,024
"We had a document analysis pipeline that was sending a 40,000-token legal template on every request. Enabling prompt caching took our daily API bill from $420 to $47. That is a 90% reduction with literally three lines of code changed." — Developer, legal tech startup
Real-World Savings
The highest-value use cases are those where a large, stable document forms the majority of every prompt. Document analysis products — where Claude reads the same contract, policy document, or technical specification dozens of times to answer different questions — are the most obvious fit. Knowledge-base chatbots that inject the same FAQ document or product documentation into every conversation also see dramatic savings. Code analysis tools that load the same codebase context for every developer query are another strong candidate.
There is a nuance worth understanding: the first request that populates a cache entry is charged at 125% of the standard input token rate (a 25% premium for the write). On every subsequent cache hit, you pay only 10%. The breakeven point is reached after just two requests. Any application with a cache hit rate above 50% — which is essentially every application with a stable, reusable context — comes out substantially ahead. Applications with 90%+ cache hit rates (common for customer support bots or document Q&A tools) see near-maximum savings.
Latency also improves. Cache reads are significantly faster than full token processing, meaning applications with large cached prefixes see 30–50% improvements in time-to-first-token on cache hits. For interactive applications where response latency matters for user experience, this is a meaningful secondary benefit beyond cost savings.
Prompt caching is available now on all Claude models via the Anthropic API. It is also supported on Amazon Bedrock for users running Claude through AWS infrastructure. The feature requires no configuration changes beyond adding the cache_control field to cacheable content blocks — the API handles all cache management automatically. Full documentation is available on the Anthropic developer portal.