35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English 35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English
Anthropic Prompt Caching in Rails: Cut Claude API Costs with the Anthropic Ruby SDK

Anthropic Prompt Caching in Rails: Cut Claude API Costs with the Anthropic Ruby SDK

Roger Heykoop
Ruby on Rails, AI
Anthropic prompt caching in Rails: cut Claude API costs up to 90% using the Anthropic Ruby SDK. Production patterns, traps and real numbers from the field.

A SaaS client called me on a Friday afternoon because their Anthropic bill had hit forty-two thousand dollars for the month and it was only the eighteenth. They had shipped a Claude-powered support assistant six weeks earlier, traffic had quadrupled, and every single request was sending the same eighteen-thousand-token system prompt and knowledge base preamble. We turned on Anthropic prompt caching that afternoon. The next bill was under five thousand.

After nineteen years of Rails I have done a lot of “this should be cheaper” engineering, and Anthropic prompt caching is the single highest-leverage optimisation I know of for production LLM apps. If you are using the Anthropic Ruby SDK and not caching, you are setting money on fire. This is the production playbook.

What Anthropic Prompt Caching Actually Is

Every Claude API call has an input prompt — system message, tool definitions, conversation history, retrieved documents — and an output completion. Most of that input is identical across requests. The system prompt does not change between users. The tool schemas do not change between calls. The thirty-page policy document you stuff into context does not change for a week.

Anthropic prompt caching lets you mark sections of your input as cacheable. The first request that includes those sections pays a write premium of 1.25x the base input price. Every subsequent request that hits the cache pays only 0.1x the base input price for those tokens. That is a 90% discount on cached input.

The default cache lifetime is five minutes from the last hit, refreshed on each read. There is also a one-hour cache option for prompts you know will stay hot. For a chatbot serving steady traffic, a five-minute TTL covers essentially every request after the first.

The mechanism is a single field on a content block:

{
  type: "text",
  text: long_system_prompt,
  cache_control: { type: "ephemeral" }
}

Anthropic hashes the cumulative prefix up to and including each cache_control marker. If the hash matches a live cache entry, you get a cache hit. If not, you write a new entry. The cache key includes the model, so changing models invalidates everything.

When Prompt Caching Pays Off (and When It Doesn’t)

The math is simple. Cache writes cost 1.25x base, cache reads cost 0.1x base. So a cached prefix breaks even after roughly two reuses and saves real money after three. Anything you reuse more than three times within five minutes is a candidate.

In a Rails app these are the workloads that benefit:

  • Customer-facing chatbots with a fixed system prompt and tool definitions
  • RAG endpoints that pin a large reference document or retrieved context per session
  • Code review or analysis bots that include a repository snapshot
  • Multi-turn assistants where conversation history grows with each turn
  • Batch classification or extraction jobs that share the same instructions across thousands of items

Where it does not pay off: one-shot calls with unique prompts every time, very short prompts (the minimum cacheable size is 1024 tokens for Sonnet/Opus and 2048 for Haiku), and any workflow where the prefix changes meaningfully on every call. If you are below the minimum, caching does nothing — your cache_control markers are silently ignored.

Setting Up Anthropic Prompt Caching in Rails with the Anthropic Ruby SDK

The official Anthropic Ruby SDK supports prompt caching natively. Add it to your Gemfile:

# Gemfile
gem "anthropic", "~> 1.0"

Configure the client with credentials from Rails credentials, not environment variables in production code:

# config/initializers/anthropic.rb
ANTHROPIC = Anthropic::Client.new(
  api_key: Rails.application.credentials.dig(:anthropic, :api_key)
)

I covered the broader patterns of integrating Claude in Rails — streaming, tools, RAG — in earlier posts. If you want context, the Rails RAG with Claude and pgvector guide and the LLM function calling in Rails post pair well with this one.

Production Pattern 1: Caching the System Prompt

The single highest-leverage move. Wrap your system prompt in a cached block:

class Support::AssistantService
  SYSTEM_PROMPT = File.read(Rails.root.join("config/prompts/support_v3.md")).freeze

  def reply(conversation:, user_message:)
    ANTHROPIC.messages.create(
      model: "claude-sonnet-4-6",
      max_tokens: 1024,
      system: [
        {
          type: "text",
          text: SYSTEM_PROMPT,
          cache_control: { type: "ephemeral" }
        }
      ],
      messages: conversation_messages(conversation, user_message)
    )
  end

  private

  def conversation_messages(conversation, user_message)
    conversation.turns.map { |t| { role: t.role, content: t.content } } +
      [{ role: "user", content: user_message }]
  end
end

Two non-obvious points. First, the system field accepts an array of content blocks, not just a string — you have to use the array form to attach cache_control. Second, freeze the prompt at load time. If you read the file inside reply, every Capistrano-style file-mtime quirk or whitespace difference becomes a cache miss.

For a bot serving a thousand requests an hour with an eighteen-thousand-token system prompt, this single change cuts input cost on that prompt from 18,000 tokens × $0.003 = $0.054 per request to $0.0054 per request. Across a million requests that is a $48,000 swing.

Production Pattern 2: Caching Long Documents in RAG

Retrieval-augmented generation pipelines often retrieve the same chunks repeatedly within a session — a user asking three questions about the same contract gets three identical retrievals. Cache the retrieved bundle:

class Knowledge::AssistantService
  def answer(query:, session_id:)
    chunks = Embeddings::Retriever.new(query).top(8)
    context = chunks.map { |c| "[#{c.id}] #{c.content}" }.join("\n\n")

    ANTHROPIC.messages.create(
      model: "claude-sonnet-4-6",
      max_tokens: 800,
      system: [
        { type: "text", text: BASE_SYSTEM_PROMPT, cache_control: { type: "ephemeral" } },
        { type: "text", text: context,           cache_control: { type: "ephemeral" } }
      ],
      messages: [{ role: "user", content: query }]
    )
  end
end

Two cache_control markers means two cache breakpoints. Anthropic supports up to four. The base prompt stays cached across the whole app. The retrieved context stays cached across a session. The user query is not cached — it changes every call.

Caveat: if your retrieval is too good and returns different chunks every question, you will not get a cache hit on the context block. In practice I find sessions cluster around topics, so retrievals overlap heavily. Measure before assuming.

Production Pattern 3: Multi-Turn Conversations

For a chatbot that grows context with each turn, the trick is to mark the latest turn as the cache breakpoint. Every previous turn becomes part of the cached prefix on the next call:

def reply(conversation:, user_message:)
  history = conversation.turns.map { |t| { role: t.role, content: t.content } }
  history.last[:cache_control] = { type: "ephemeral" } if history.any?

  ANTHROPIC.messages.create(
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [{ type: "text", text: SYSTEM_PROMPT, cache_control: { type: "ephemeral" } }],
    messages: history + [{ role: "user", content: user_message }]
  )
end

After turn one, the system prompt is cached. After turn two, the system prompt and turn one are cached. After turn three, all of that plus turn two. By turn ten the cache is doing real work — you are paying full price only for the latest user message and the model’s reply. This is why long support conversations get progressively cheaper instead of progressively more expensive.

Production Pattern 4: Tool Definitions

If you are using tool use, your tool schemas can be hundreds or thousands of tokens. They almost never change between requests. Cache them:

TOOLS = JSON.parse(File.read(Rails.root.join("config/prompts/tools.json"))).freeze

ANTHROPIC.messages.create(
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  tools: TOOLS,
  system: [{ type: "text", text: SYSTEM_PROMPT, cache_control: { type: "ephemeral" } }],
  messages: messages
)

The cache_control marker on the system prompt covers everything before it in the canonical request order, which includes the tools array. So a single marker on the system prompt caches both. If you have many tools or stream-style tool definitions you generate dynamically, normalise them — same JSON output, same key order — or your hash will differ on every request and you will never hit cache.

I dug deeper into wiring Claude tools into Rails in the LLM function calling guide; that post is a good companion if you are bolting caching onto an existing tool-using assistant.

Measuring Cache Hit Rate in Production

You cannot optimise what you do not measure. The Anthropic API returns cache statistics on every response:

response = ANTHROPIC.messages.create(...)

response.usage.input_tokens               # uncached input this call
response.usage.cache_creation_input_tokens # tokens you wrote to cache
response.usage.cache_read_input_tokens     # tokens served from cache
response.usage.output_tokens

Push these to your metrics pipeline on every call:

class Llm::Metered
  def self.call(label:, **kwargs)
    response = ANTHROPIC.messages.create(**kwargs)
    u = response.usage

    StatsD.increment("llm.calls", tags: ["label:#{label}"])
    StatsD.histogram("llm.cache_read_tokens",     u.cache_read_input_tokens,     tags: ["label:#{label}"])
    StatsD.histogram("llm.cache_creation_tokens", u.cache_creation_input_tokens, tags: ["label:#{label}"])
    StatsD.histogram("llm.input_tokens",          u.input_tokens,                tags: ["label:#{label}"])
    StatsD.histogram("llm.output_tokens",         u.output_tokens,               tags: ["label:#{label}"])

    response
  end
end

The metric you want on a dashboard is cache_read / (cache_read + cache_creation + input). A healthy production chatbot lands at 80–95% within a few minutes of warm traffic. If you are below 50% sustained, your cache key is changing too often — usually because something dynamic snuck into the cached prefix.

Traps Nobody Warns You About

Five traps I or my clients have hit since prompt caching went GA.

Whitespace and ordering changes invalidate the cache. Reading a Markdown file with File.read gives you a fresh string with whatever line endings the file has. Trim trailing whitespace at load time. Freeze. If you build the system prompt by interpolating, make sure the interpolated values are also cached or stripped — "You are #{Time.current}" is a 100% miss rate.

Cache markers are silently ignored below the minimum size. If your system prompt is 800 tokens, no caching happens, and you may not notice for weeks because the API does not error. Always check cache_creation_input_tokens on the first call after a deploy. If it is zero, you are below the threshold.

Switching models invalidates everything. A cache entry is keyed by model. Rolling out Sonnet 4.6 to replace Sonnet 4 means a stampede of cache writes for the first five minutes. If you are paying close attention, do model rollouts during low traffic.

Beta headers can change. Some advanced caching modes, like the one-hour TTL, have at times required a beta header (anthropic-beta: extended-cache-ttl-2025-04-11 or similar — check the current docs). The Ruby SDK lets you pass extra headers; pin the version and read the changelog at upgrade time.

Cache hits do not count toward rate limits the same way as fresh tokens. This is a feature, not a bug — your effective throughput goes up. But it also means autoscaling logic based on input tokens will undercount load. Track cache reads separately if you scale workers based on token volume. The same way I advocated for tracking concurrency separately in the Puma tuning guide, here you want a dedicated metric.

When Not to Use Anthropic Prompt Caching

Three cases where it is the wrong tool.

Truly unique prompts. A code search tool that crafts a different system prompt per query has no shared prefix. Skip caching, focus on output token reduction instead.

Very low traffic. If you make twelve calls a day spread across business hours, your cache is cold every time. The 1.25x write penalty becomes a permanent tax.

Latency-sensitive single calls. Cache writes are slightly slower than uncached calls. For a one-shot extraction in a job that runs nightly, you do not benefit. For a synchronous user-facing chatbot, the latency hit on the first call is amortised across thousands of subsequent fast hits — worth it.

Frequently Asked Questions

How much can Anthropic prompt caching reduce Claude API costs?

In production Rails apps with a stable system prompt and steady traffic I see 70–90% reduction in input token cost. The variation depends on prompt length and traffic shape: longer cached prefixes and steadier traffic give bigger savings. Output tokens are not cached and do not change.

What is the minimum size for Anthropic prompt caching?

The minimum is 1024 tokens for Claude Sonnet and Opus models, and 2048 tokens for Haiku. Anything below that is silently not cached even if you pass cache_control. Check cache_creation_input_tokens on the first call to confirm caching is active.

Does prompt caching work with streaming responses in Rails?

Yes. Caching applies to the input prefix, which is identical whether you stream or not. The cache statistics arrive in the final usage event of the stream. If you are using ActionController::Live to stream Claude responses, see the streaming LLM responses guide — caching layers cleanly on top.

How long does the Anthropic prompt cache last?

The default ephemeral cache lasts five minutes from the last read, refreshing on each hit. There is also a one-hour TTL option for prompts you know will stay hot longer; it costs slightly more to write but extends the reuse window. For most chatbots the five-minute default is plenty because steady traffic keeps it warm continuously.


After nineteen years of Rails I have learned that the best optimisations are the ones that change a single configuration line and produce a graph that drops by an order of magnitude. Anthropic prompt caching is one of those. If you are running Claude in production through the Anthropic Ruby SDK, mark your prefixes today, watch the bill drop tomorrow.

Need help integrating Claude into a Rails system or auditing your LLM costs? TTB Software ships production-grade AI infrastructure on Rails. We have been doing this for nineteen years.

#anthropic-prompt-caching #claude-api-cost-reduction #anthropic-ruby-sdk #rails-llm-integration #prompt-caching-rails #claude-sdk-rails #ruby-on-rails
R

About the Author

Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.

Get in Touch

Share this article

Need Expert Rails Development?

Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise

Schedule a Free Consultation