Anthropic Message Batches in Rails: Cut Claude API Costs 50% with Async Batch Processing
A founder I work with runs a content classification pipeline that pushes about forty thousand documents a day through Claude. They were paying just under twelve thousand dollars a week for it. The CFO wanted to know if there was a cheaper way before he renewed the budget. There was: Anthropic Message Batches. We moved the pipeline over a Friday afternoon, the next week’s bill was just under six thousand, and the only behavioural change anyone noticed was that results landed twenty minutes after upload instead of two seconds after each request.
After nineteen years of Rails I have built a lot of “send this to a third party API in the background and store the result” pipelines, and Anthropic Message Batches is the cleanest version of that pattern I have seen for LLM workloads. If you are running any kind of bulk classification, summarisation, extraction, or evaluation against Claude, and you do not need a synchronous answer, the Anthropic Message Batches API is the lever to pull. This is the production playbook.
What Anthropic Message Batches Actually Are
The Claude API has two delivery modes. The synchronous Messages API is what most Rails apps start with — you POST /v1/messages, you get a response in a few seconds, you write it to the database. That is fine for chat, low-volume agents, and anything user-facing where a human is waiting.
Anthropic Message Batches is the async cousin. You upload a JSONL-shaped batch of up to ten thousand messages or 256 MB of payload in a single request. Anthropic acknowledges the batch, processes it within twenty-four hours (usually much faster — minutes for small batches), and then exposes a results URL with one JSONL line per request. Every call inside the batch costs fifty percent of the synchronous price, including cached tokens. The 50% discount stacks with prompt caching, so a batched call hitting a warm cache costs five percent of the synchronous, uncached baseline.
There are three failure modes you need to design around: individual requests inside a batch can fail while the batch as a whole succeeds, the batch can be cancelled, and the batch can expire if Anthropic cannot finish it inside the window. None of these are exotic — they are the same operational concerns as any other async pipeline — but Rails apps usually start with synchronous calls, and the mental model has to shift.
The Anthropic Ruby SDK exposes batches as client.messages.batches. You create a batch, you poll for its status, you stream results when it finishes:
client = Anthropic::Client.new
batch = client.messages.batches.create(
requests: [
{
custom_id: "doc-1",
params: {
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Classify: ..." }]
}
}
]
)
batch.id # => "msgbatch_01..."
batch.processing_status # => "in_progress"
The custom_id is the only thing you control inside the batch envelope, and it is the single most important thing in the design. It is how you reconcile results with the originating Rails records.
When Anthropic Message Batches Pay Off
The math on Anthropic Message Batches is honestly easier than prompt caching. Half price, no break-even, no warm-up. The only question is whether your workload tolerates async delivery.
These are the Rails workloads where it is a clear win. Nightly enrichment of records that came in during the day — classify, tag, embed metadata, or summarise. Re-processing historical data after a prompt change. Bulk evaluations and red-team runs against a model release. Generating alt text, descriptions, or seo blurbs for a content library. Anything that ends with “…and then stash the result on the record.”
The places it does not fit are the obvious ones. User-facing chat where a human is waiting. Tool-using agents that need to react to model output and decide the next call. Streaming responses. Workloads where the upstream system needs the answer to make a synchronous decision. For these, stay on the regular Messages API and lean on prompt caching for cost.
Where it gets interesting is the middle ground. A “submit a job, get an email when it is done” workflow inside a SaaS app maps perfectly onto Anthropic Message Batches. So does a “process this CSV of leads through Claude and write back to HubSpot” import. If the user can wait minutes, you should be batching.
Building the Anthropic Message Batches Pipeline in Rails
Here is the production shape I keep returning to. One Rails model for the batch envelope, one for each individual request, a Solid Queue job for submission, a polling job for status, and an idempotent result handler. Nothing exotic.
# db/migrate/20260430000001_create_claude_batches.rb
class CreateClaudeBatches < ActiveRecord::Migration[8.0]
def change
create_table :claude_batches do |t|
t.string :anthropic_id, index: { unique: true }
t.string :status, null: false, default: "pending"
t.integer :request_count, null: false, default: 0
t.integer :succeeded_count, null: false, default: 0
t.integer :errored_count, null: false, default: 0
t.datetime :submitted_at
t.datetime :ended_at
t.timestamps
end
create_table :claude_batch_requests do |t|
t.references :claude_batch, null: false, foreign_key: true
t.references :subject, polymorphic: true, null: false
t.string :custom_id, null: false
t.jsonb :params, null: false, default: {}
t.string :result_status
t.jsonb :result_payload
t.timestamps
t.index [:claude_batch_id, :custom_id], unique: true
end
end
end
The polymorphic subject is the Rails record the request is about — a Document, a Lead, a Product, whatever. The custom_id is what we send to Anthropic, and we make it deterministic so retries are safe.
class ClaudeBatchRequest < ApplicationRecord
belongs_to :claude_batch
belongs_to :subject, polymorphic: true
before_validation :assign_custom_id, on: :create
private
def assign_custom_id
self.custom_id ||= "#{subject_type.underscore}-#{subject_id}-#{SecureRandom.hex(4)}"
end
end
The submission job builds the JSONL payload, ships it, and stores the Anthropic batch id. I keep the body construction in a plain Ruby object rather than the job itself — easier to test, easier to swap models later.
class ClaudeBatchSubmitter
def initialize(claude_batch)
@claude_batch = claude_batch
@client = Anthropic::Client.new
end
def call
requests = @claude_batch.claude_batch_requests.map do |req|
{ custom_id: req.custom_id, params: req.params }
end
response = @client.messages.batches.create(requests: requests)
@claude_batch.update!(
anthropic_id: response.id,
status: "in_progress",
request_count: requests.size,
submitted_at: Time.current
)
ClaudeBatchPollJob.set(wait: 30.seconds).perform_later(@claude_batch.id)
end
end
Polling is the part everyone wants to over-engineer. The Anthropic API does not push webhooks for batches, so you have to poll. Solid Queue makes this cheap because re-enqueuing a job with wait: is a single insert into the database. I poll every thirty seconds for the first five minutes, then back off to a minute, then to five.
class ClaudeBatchPollJob < ApplicationJob
queue_as :claude_batches
def perform(claude_batch_id)
batch = ClaudeBatch.find(claude_batch_id)
return if batch.status == "ended"
response = Anthropic::Client.new.messages.batches.retrieve(batch.anthropic_id)
case response.processing_status
when "in_progress"
reschedule(batch)
when "ended"
ClaudeBatchResultIngestJob.perform_later(batch.id)
when "canceling", "canceled", "expired"
batch.update!(status: response.processing_status, ended_at: Time.current)
end
end
private
def reschedule(batch)
age = Time.current - batch.submitted_at
delay = case age
when 0..5.minutes then 30.seconds
when 5.minutes..30.minutes then 1.minute
else 5.minutes
end
self.class.set(wait: delay).perform_later(batch.id)
end
end
Result ingestion is where idempotency matters. Anthropic exposes results as a streaming JSONL endpoint. You read line by line, look up the request by custom_id, and write the outcome. If the job dies halfway through and re-runs, the unique index on [claude_batch_id, custom_id] plus an if request.result_status.nil? guard keeps you safe.
class ClaudeBatchResultIngestJob < ApplicationJob
queue_as :claude_batches
def perform(claude_batch_id)
batch = ClaudeBatch.find(claude_batch_id)
client = Anthropic::Client.new
client.messages.batches.results(batch.anthropic_id).each do |entry|
ingest_one(batch, entry)
end
batch.update!(
status: "ended",
ended_at: Time.current,
succeeded_count: batch.claude_batch_requests.where(result_status: "succeeded").count,
errored_count: batch.claude_batch_requests.where.not(result_status: "succeeded").count
)
end
private
def ingest_one(batch, entry)
request = batch.claude_batch_requests.find_by(custom_id: entry.custom_id)
return unless request
return if request.result_status.present?
request.update!(
result_status: entry.result.type,
result_payload: entry.result.to_h
)
ClaudeBatchRequestProcessor.new(request).call if entry.result.type == "succeeded"
end
end
The downstream processor is application-specific — write the classification to the Document, attach the embedding, send the notification. Keep it boring. The whole win of Anthropic Message Batches is moving cost out of the synchronous path; do not give that win back by making the result handler complicated.
Anthropic Message Batches and Prompt Caching Together
This is the part most teams miss. Anthropic Message Batches pricing stacks with prompt caching. Cached input tokens inside a batch are billed at 0.05x of the synchronous uncached rate — half of half of base input. If your batch shares a system prompt or a large preamble across thousands of requests, structure it so the prefix is identical and add cache_control: { type: "ephemeral" } on the last shared block.
shared_system = [
{ type: "text", text: long_system_prompt,
cache_control: { type: "ephemeral" } }
]
requests = documents.map do |doc|
{
custom_id: "doc-#{doc.id}",
params: {
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: shared_system,
messages: [{ role: "user", content: "Classify: #{doc.body}" }]
}
}
end
The order matters. Anthropic hashes the prefix up to each cache marker, so anything that varies between requests has to come after the marker. If you put the document body inside the cached system, you have just turned every request into a cache miss and burned the discount.
For a forty-thousand-document daily run with an eight-thousand-token shared system prompt, the difference between cached batches and uncached batches is roughly an order of magnitude on the input bill. I covered the cache mechanics in detail in Anthropic Prompt Caching in Rails — the same patterns apply inside batches with the additional 50% discount on top.
Operational Gotchas with Anthropic Message Batches
Five things have bitten me in production. None of them are subtle once you know to look for them.
Batch size limits matter. You cannot stuff a hundred thousand requests into one batch. The cap is ten thousand requests or 256 MB. Above that you have to split, and the splitting has to be deterministic so retries do not double-process. I shard by subject_id % batch_count and store the shard on the ClaudeBatch row.
Rate limits on batch creation are separate from synchronous rate limits. You can have plenty of synchronous tokens left and still get throttled when submitting batches. Wrap the create call in retry-with-backoff and treat 429s as a signal to slow down submission, not to fail the job.
Individual request errors do not fail the batch. A batch can land with five thousand succeeded and five thousand errored and Anthropic will report the batch itself as ended cleanly. You have to inspect each result. The most common per-request error I see is overloaded_error — usually safe to re-batch the failures with a delay.
The 24-hour expiry is a real expiry. If Anthropic does not finish your batch inside the window, the requests that did not complete come back as expired and you do not get charged for them. But you also do not get a result. Always plan for partial completion and have a re-submission path.
Cost reporting lags. The Anthropic dashboard’s per-batch cost takes longer to appear than synchronous spend. Do not size your savings off the live dashboard the day you ship — wait a week, look at the invoice.
Frequently Asked Questions
How much does the Anthropic Message Batches API actually save versus synchronous Claude API calls?
Anthropic Message Batches are billed at fifty percent of the synchronous Messages API rate for both input and output tokens. The discount applies to every model and stacks with prompt caching, so a batched request hitting a warm cache costs roughly five percent of the synchronous, uncached baseline. For workloads that already use prompt caching, batching is the single biggest remaining cost lever.
What is the maximum batch size for Anthropic Message Batches?
A single Anthropic Message Batches submission is capped at ten thousand requests or 256 MB of payload, whichever you hit first. Above that you have to split into multiple batches. The processing window is up to twenty-four hours from creation, though small batches typically complete in minutes.
How do I handle individual request failures inside an Anthropic Message Batches result set?
Inspect the result.type of each entry in the streamed JSONL response. Possible values are succeeded, errored, canceled, and expired. The batch itself is marked ended even when individual requests fail, so you must iterate every line and decide per-request whether to re-submit. The most common transient failure is overloaded_error, which is safe to re-batch with a backoff.
Can I use Anthropic Message Batches with prompt caching and the Anthropic Ruby SDK at the same time?
Yes. Add cache_control: { type: "ephemeral" } on the shared prefix of each request inside the batch, exactly as you would for synchronous calls. Cached input tokens inside a batch are billed at 0.05x the synchronous uncached rate. The Anthropic Ruby SDK passes the cache control field through unchanged, so the pattern is identical to non-batched code.
When should I not use Anthropic Message Batches?
Anything user-facing where a human is waiting on the response, any agentic loop where the next call depends on the model output, and any streaming workload. Stay on the synchronous Messages API for those and use prompt caching for cost. Batching is for “submit ten thousand jobs, come back later, write the results to the database” work — not interactive traffic.
Need help cutting LLM costs in production Rails? TTB Software specializes in reliable AI integrations for Rails apps — we build the batch pipelines, caching, and async infrastructure that makes Claude affordable at scale. We have been doing Rails for nineteen years.
About the Author
Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.
Get in TouchRelated Articles
Need Expert Rails Development?
Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise
Schedule a Free Consultation