Rails LLM Cost Tracking: Per-Tenant Spend, Budget Caps, and Real-Time Quota Enforcement
Rails LLM cost tracking that survives a $40k surprise. Per-tenant token accounting, budget caps, quota enforcement, and finance-ready reports for your AI features.
A founder messaged me on a Sunday morning to say his OpenAI invoice for the previous week was forty-one thousand dollars. The product was a B2B SaaS with a chat-with-your-docs feature, three hundred paying tenants, and a generous free trial. One trial account had spun up a script that called the chat endpoint in a loop for four days straight. Nobody noticed because the bill arrived weekly and nothing in the Rails app knew, at request time, what any single tenant had spent. Rails LLM cost tracking is one of those problems that nobody builds for until they get burned once, and the burn is always more expensive than the week of work it would have taken to ship it properly.
After nineteen years of Rails and the last three of those wiring up production AI features for clients, I have come to treat per-tenant LLM accounting the same way I treat database connection pooling — boring infrastructure that you build before you need it because retrofitting it under fire is miserable. This post is the cost tracking stack I now ship by default on any Rails app that calls an LLM in production: the data model, the client wrapper, the real-time quota enforcer, and the finance-ready reports.
Why Rails LLM Cost Tracking Belongs In Your App, Not Your Provider Dashboard
OpenAI’s usage dashboard tells you what your organisation spent. Anthropic’s tells you the same. Neither tells you which of your customers caused that spend, which of your features burned the budget, or which tenant is twenty minutes away from costing more than they pay you. Provider dashboards are accounting; Rails LLM cost tracking is operations.
Three things break when you rely on the provider dashboard alone. The first is per-tenant economics — you cannot price an AI feature if you do not know the gross margin per customer. The second is abuse detection — by the time a runaway loop shows up in tomorrow’s billing CSV, you have already paid for it. The third is product control — you cannot show a “you have used 80% of your monthly tokens” banner in the UI without an internal counter. None of those problems get solved by waiting for an invoice.
The fix is to record every LLM call inside your Rails app, attribute it to a tenant and a feature, convert tokens to dollars at request time, and check that number against a per-tenant budget before you make the next call. The whole pattern is maybe four hundred lines of Ruby and saves you the kind of weekend I described in the opening.
The Data Model For Per-Tenant LLM Accounting
Start with a single table that records one row per LLM call. Do not try to be clever with aggregations until you have the raw data — Postgres aggregates fast enough on millions of rows that you can backfill summaries later.
# db/migrate/20260627000001_create_llm_usage_events.rb
class CreateLlmUsageEvents < ActiveRecord::Migration[8.0]
def change
create_table :llm_usage_events do |t|
t.references :account, null: false, foreign_key: true
t.references :user, foreign_key: true
t.string :feature, null: false # "chat", "summarize", "embed"
t.string :provider, null: false # "openai", "anthropic", "voyage"
t.string :model, null: false # "gpt-4o", "claude-opus-4-7"
t.integer :prompt_tokens, null: false, default: 0
t.integer :completion_tokens, null: false, default: 0
t.integer :cached_tokens, null: false, default: 0
t.decimal :cost_usd, precision: 12, scale: 6, null: false, default: 0
t.integer :latency_ms
t.string :request_id
t.string :error_code
t.jsonb :metadata, null: false, default: {}
t.datetime :created_at, null: false
end
add_index :llm_usage_events, [:account_id, :created_at]
add_index :llm_usage_events, [:account_id, :feature, :created_at]
add_index :llm_usage_events, :request_id, unique: true, where: "request_id IS NOT NULL"
end
end
A few choices worth flagging. cost_usd is a decimal because floats are not for money, and at six decimal places you can record a fraction of a cent without rounding noise. cached_tokens is tracked separately so you can prove the value of Anthropic prompt caching to the CFO. request_id carries the provider’s request identifier so you can join your logs against support tickets when a customer complains about a specific answer. The partial unique index on request_id lets you safely retry the recording without creating duplicates.
A second, much smaller table holds the per-tenant budget and the running total for the current period. Keep this denormalised on purpose — at quota-enforcement time you want a single indexed lookup, not a SUM across a million rows.
# db/migrate/20260627000002_create_llm_budgets.rb
class CreateLlmBudgets < ActiveRecord::Migration[8.0]
def change
create_table :llm_budgets do |t|
t.references :account, null: false, foreign_key: true, index: { unique: true }
t.decimal :monthly_limit_usd, precision: 10, scale: 2, null: false, default: 0
t.decimal :current_spend_usd, precision: 12, scale: 6, null: false, default: 0
t.date :period_start, null: false
t.string :status, null: false, default: "active" # active, throttled, suspended
t.timestamps
end
end
end
Wrapping Your LLM Client With Token Accounting
The wrapper is the heart of the system. Every call to OpenAI, Anthropic, or your embedding provider goes through one method, and that method writes a usage event on the way out. If you have client code that calls OpenAI::Client.new.chat(...) scattered across controllers and jobs, your first refactor is to funnel it through a single service.
# app/services/llm/client.rb
class Llm::Client
PRICING = {
"gpt-4o" => { input: 0.0000025, output: 0.000010, cached: 0.00000125 },
"gpt-4o-mini" => { input: 0.00000015, output: 0.0000006, cached: 0.000000075 },
"claude-opus-4-7" => { input: 0.000015, output: 0.000075, cached: 0.0000015 },
"claude-sonnet-4-6"=> { input: 0.000003, output: 0.000015, cached: 0.0000003 }
}.freeze
def initialize(account:, feature:, user: nil)
@account = account
@feature = feature
@user = user
end
def chat(model:, messages:, **options)
Llm::QuotaGuard.new(@account).check!
started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
response = provider_for(model).chat(model: model, messages: messages, **options)
latency_ms = ((Process.clock_gettime(Process::CLOCK_MONOTONIC) - started_at) * 1000).to_i
record_usage!(model: model, response: response, latency_ms: latency_ms)
response
rescue Llm::ProviderError => e
record_usage!(model: model, response: nil, latency_ms: nil, error: e)
raise
end
private
def record_usage!(model:, response:, latency_ms:, error: nil)
usage = response&.dig("usage") || {}
prompt = usage["prompt_tokens"].to_i
completion = usage["completion_tokens"].to_i
cached = usage.dig("prompt_tokens_details", "cached_tokens").to_i
cost = calculate_cost(model: model, prompt: prompt, completion: completion, cached: cached)
event = LlmUsageEvent.create!(
account: @account, user: @user, feature: @feature,
provider: provider_name_for(model), model: model,
prompt_tokens: prompt, completion_tokens: completion, cached_tokens: cached,
cost_usd: cost, latency_ms: latency_ms,
request_id: response&.dig("id"),
error_code: error&.code,
metadata: { temperature: response&.dig("temperature") }.compact
)
Llm::BudgetUpdater.call(account: @account, delta_usd: cost)
event
end
def calculate_cost(model:, prompt:, completion:, cached:)
rates = PRICING.fetch(model) { raise "Unknown model: #{model}" }
billed_input = prompt - cached
(billed_input * rates[:input]) + (completion * rates[:output]) + (cached * rates[:cached])
end
end
The single most important line in there is Llm::QuotaGuard.new(@account).check!. Quota enforcement happens before the API call, not after. If a tenant has blown their budget, you do not want to send the request, eat the cost, and then notice on the way back. The provider does not give refunds for “I changed my mind.”
The second most important detail is that record_usage! runs even when the provider call errors. Failed calls still cost you tokens up to the point of failure on some providers and even when they do not, you want the error rate per tenant in your dashboard. Wrap the whole thing in ensure if you want belt-and-braces, but in practice the explicit rescue is more readable.
Real-Time Quota Enforcement Without Killing Throughput
A naive quota check does a SUM across llm_usage_events for the current period on every request. That works at one tenant and ten requests per minute. It collapses at five hundred tenants and ten thousand requests per minute. The fix is to maintain a counter in llm_budgets.current_spend_usd and check it atomically.
# app/services/llm/quota_guard.rb
class Llm::QuotaGuard
class QuotaExceeded < StandardError; end
def initialize(account)
@account = account
@budget = account.llm_budget || account.create_llm_budget!(
monthly_limit_usd: account.plan.monthly_llm_limit_usd,
period_start: Date.current.beginning_of_month
)
end
def check!
rollover_if_new_period!
return if @budget.monthly_limit_usd.zero? # 0 = unlimited (use carefully)
raise QuotaExceeded, "Account #{@account.id} suspended" if @budget.status == "suspended"
if @budget.current_spend_usd >= @budget.monthly_limit_usd
@budget.update!(status: "throttled")
raise QuotaExceeded,
"Account #{@account.id} over monthly limit (#{@budget.current_spend_usd} / #{@budget.monthly_limit_usd} USD)"
end
end
private
def rollover_if_new_period!
return if @budget.period_start == Date.current.beginning_of_month
@budget.with_lock do
next if @budget.period_start == Date.current.beginning_of_month
@budget.update!(
period_start: Date.current.beginning_of_month,
current_spend_usd: 0,
status: "active"
)
end
end
end
The companion updater uses an atomic UPDATE so two concurrent requests cannot both think they are the one that pushed the tenant over the line.
# app/services/llm/budget_updater.rb
class Llm::BudgetUpdater
def self.call(account:, delta_usd:)
LlmBudget.where(account_id: account.id).update_all([
"current_spend_usd = current_spend_usd + ?, updated_at = NOW()", delta_usd
])
end
end
That update_all issues a single atomic increment in Postgres. No race, no read-modify-write, no with_lock. For most production workloads this is enough. If you push above a few hundred LLM calls per second per tenant you will want to move the counter to Redis with a periodic flush to Postgres — but that is a problem you will not have for a while, and prematurely solving it makes debugging harder. We covered the same pattern in Rails rate limiting with Rack::Attack, where the trade-off between accuracy and latency comes up in identical form.
Surfacing Cost To The Right People
Internal dashboards are the easy part. The harder choice is what you show to whom. After many client engagements, the pattern I keep landing on is three audiences with three different views.
Developers see per-feature spend in their staging environment, broken down by model and by tenant, with a daily Slack digest of the top five callers. The intent is to catch the new RAG endpoint that uses ten times more tokens than expected before it ships to prod.
Tenants see their own usage in the app — a meter, a forecast, and a “you are at 78% of your monthly allowance” warning. The exact wording matters here. A meter that turns red at 95% changes behaviour. A bill at the end of the month does not. We treat this UI as a product feature, not an admin afterthought.
Finance gets a monthly CSV with account_id, monthly_spend_usd, prompt_tokens, completion_tokens, and cached_tokens. The same query feeds the gross margin spreadsheet and the customer success “which accounts are at risk” report. One query, three downstream uses.
# app/queries/llm/monthly_spend_query.rb
class Llm::MonthlySpendQuery
def self.call(period: Date.current.beginning_of_month..Date.current.end_of_month)
LlmUsageEvent
.where(created_at: period)
.group(:account_id)
.select(
"account_id",
"SUM(prompt_tokens) AS prompt_tokens",
"SUM(completion_tokens) AS completion_tokens",
"SUM(cached_tokens) AS cached_tokens",
"SUM(cost_usd) AS spend_usd"
)
.order("spend_usd DESC")
end
end
A Postgres index on (account_id, created_at) keeps this query under a hundred milliseconds well into the tens of millions of rows. Once you outgrow that, the same query becomes a nightly materialised view — we wrote about that pattern in pg_stat_statements for finding slow queries in production.
What To Do When You Catch An Abuser
The first time the quota guard fires on a paying customer, you will be tempted to raise their limit. Do not, at least not before you understand what happened. Half the time it is genuine growth and you want to upsell. The other half it is a buggy integration on their side that is pounding your endpoint in a loop, and raising the limit just makes the bug more expensive.
The pattern that works is a three-step playbook. Suspend the LLM features for the account programmatically — the guard already does this when status flips to throttled. Send a single, plain-English email to the account owner: “Your account hit its monthly AI usage limit. Here is what we saw, here is what it cost, please let us know if this looks expected.” Open a Slack channel internally with the offending account’s recent events, top features, and top calling users so support, sales, and engineering can decide together.
If it is abuse, suspend, refund nothing, and document. If it is growth, lift the cap, log the new limit, and queue an upsell conversation. The point of the system is not to punish customers; it is to give you the information to make either decision quickly.
FAQ
How accurate is Rails LLM cost tracking against the provider’s invoice?
In practice, within a few percent. The two sources of drift are taxes and discounts that the provider applies at billing time, and the rare race where a request was charged but the response never reached your code. I reconcile monthly and have never seen a delta worth chasing below 2%.
Should I store the full prompt and completion text?
Not by default. It explodes your database, complicates GDPR, and is rarely worth it. Store the request_id and a hash of the prompt. If you need replay for evaluations, mirror prompts to a separate, access-controlled bucket — we cover this in Rails LLM evals for testing prompts in CI.
What about streaming responses where token counts arrive at the end?
Most providers return a final usage chunk in the SSE stream. The wrapper captures it on stream close and records the event then. If the stream is cut off mid-flight, record what you have with an error_code so you can spot patterns of disconnects.
How do I track LLM cost when the call happens inside a background job?
Same wrapper. Pass account and feature to the job and instantiate Llm::Client.new(account: account, feature: "summarize") inside perform. Do not try to record from the worker outside the wrapper — you will end up with two code paths for the same accounting logic, and they will diverge within three sprints.
Need help shipping production-grade Rails LLM cost tracking — per-tenant accounting, real-time quotas, finance-ready reports? TTB Software specializes in Rails AI infrastructure for SaaS teams who need the AI feature to be profitable, not just impressive. We’ve been doing this for nineteen years.
Related Articles
Rails Counter Cache: Eliminate N+1 COUNT Queries Without the Production Gotchas
Rails counter cache kills N+1 COUNT queries on has_many associations. Set it up properly, reset stale counters, and d...
Rails ActionMailer Production Guide: Email Deliverability, Modern APIs, and Bulletproof Testing
Rails ActionMailer production setup: Resend, Postmark, or SendGrid, inbox-reliable delivery, bounce handling, deliver...
Rails Phlex: Ruby-First View Components That Beat ERB and ViewComponent on Speed
Rails Phlex writes views in pure Ruby — no templates, no DSL surprises. Faster than ERB, smaller than ViewComponent, ...