Rails LLM Evals: Testing Prompts in CI Before They Break Production
Rails LLM evals catch prompt regressions before they ship to production. Build golden datasets, score CI runs, and track token costs for Claude and OpenAI.
A Rails team I advise shipped a one-line change to a customer-support classifier prompt on a Wednesday afternoon. The prompt had been quietly handling 18,000 inbound tickets a day for nine months. By Thursday morning, urgent escalations were being routed to the billing inbox, billing disputes were going to the engineering on-call rotation, and the support lead was sitting at her desk wondering whether she had lost her mind. The change was three words. They thought it would make the prompt more concise. It silently flipped the meaning of one of the categories the model had to predict. There was no test. There was no eval. There was just the diff, the deploy, and the wreckage. We spent a Friday writing Rails LLM evals the way we should have written them on day one, and they have not had a silent regression since.
After nineteen years of Rails I have stopped being surprised by what production code does. After three years of running LLMs in production, the only thing that still surprises me is how casually teams treat prompt changes. A prompt is a function. It has inputs, it has outputs, it has a behavior contract, and that contract can be broken by a comma. We test our Ruby. We test our SQL. We have to test our prompts.
Why Rails LLM Evals Are Not Optional Anymore
A traditional test asserts that a function returns a deterministic value. An LLM call does not return a deterministic value. Run the same prompt twice, get two different strings. That is the property that makes people throw up their hands and skip testing entirely. It is also the property that makes evals essential. The whole point of a Rails LLM eval is to assert on the distribution of outputs across a curated dataset, not on a single golden string.
If you have any of the following in production, you owe yourself evals: a classification prompt routing to humans, a summarization prompt presented to a customer, a structured-extraction prompt feeding a database write, a tool-using agent making side-effectful calls, or any prompt where a silent quality drop would harm a user before anyone noticed. That is most production Rails apps that touch AI.
The cost of evals is small. The cost of a quiet regression that runs for two weeks before someone notices is enormous. Last quarter a fintech I work with had a Claude-powered transaction categorizer that silently regressed accuracy from 94 percent to 71 percent after a model upgrade. They caught it on day eleven. The cleanup took a month.
The Anatomy of a Rails LLM Eval
A useful eval has four parts: a dataset of inputs with known good outputs, a runner that calls your prompt against each input, a scorer that produces a per-example judgment, and a reporter that turns a pile of judgments into a pass-fail signal for CI. The Ruby community does not have a dominant framework for this yet, which is good news, because rolling your own is about 200 lines of code and gives you exactly the right ergonomics for your codebase.
I keep evals in test/evals/ or spec/evals/ so they are visible to engineers without being part of the unit-test run. They are a separate Rake task in CI, gated on changes to prompt files or model-configuration files, and they post their summary back to the pull request.
Golden Datasets: Where Eval Quality Lives Or Dies
The single most underrated part of evals is the dataset. A bad dataset will pass a regressed prompt. A good dataset will catch a regression a human reviewer would miss. I aim for 50 to 200 examples per prompt, sampled with intent.
# test/evals/datasets/ticket_classifier.yml
- id: billing_dispute_polite
input: "Hi, I noticed I was charged twice for my March subscription. Can you help?"
expected_category: billing
expected_urgency: medium
notes: "Polite billing dispute, common path"
- id: angry_outage_caps
input: "EVERYTHING IS DOWN AND I HAVE A DEMO IN 20 MINUTES, FIX THIS NOW"
expected_category: outage
expected_urgency: critical
notes: "Adversarial caps, critical urgency cue"
- id: feature_request_disguised
input: "Is there a way to export my data as CSV? I really need this for my board meeting."
expected_category: feature_request
expected_urgency: low
notes: "Sounds urgent but is a feature request"
Where do the examples come from? Three sources. First, the bug reports — every time a customer or an internal reviewer flags a misclassification, that example goes into the dataset before the fix ships. Second, manually constructed adversarial examples for every category boundary you care about. Third, a sampled slice of real production traffic with hand-labeled expected outputs. Treat the dataset as a long-lived asset. Version it. Review it in pull requests. Do not let it rot.
A Minimal Eval Runner in Pure Ruby
Here is the runner I use as a starting point on every Rails project. It is plain Ruby, parallel via threads (LLM calls are I/O bound), and produces a structured result you can pipe anywhere.
# lib/llm_evals/runner.rb
require "yaml"
require "concurrent"
module LlmEvals
class Runner
def initialize(dataset_path:, prompt:, scorer:, concurrency: 8)
@dataset = YAML.load_file(dataset_path, permitted_classes: [Symbol])
@prompt = prompt
@scorer = scorer
@pool = Concurrent::FixedThreadPool.new(concurrency)
end
def run
futures = @dataset.map do |example|
Concurrent::Promises.future_on(@pool) do
response = @prompt.call(example.fetch("input"))
score = @scorer.call(example, response)
{
id: example.fetch("id"),
input: example.fetch("input"),
expected: example.except("id", "input", "notes"),
actual: response.body,
score: score,
tokens: response.usage,
latency_ms: response.latency_ms
}
end
end
futures.map(&:value!)
ensure
@pool.shutdown
@pool.wait_for_termination
end
end
end
The prompt and scorer are plain objects with a call method, which means you can swap them per test and you do not need any DSL. Eight concurrent calls is a sensible default — high enough to keep wall-clock time low, low enough that you will not be rate-limited even on a small Anthropic tier.
Two Scorers That Cover 80 Percent of Real Cases
Most production prompts fall into one of three buckets: classification, structured extraction, and free-form generation. The first two are scored with simple deterministic comparisons. The third needs an LLM judge — which sounds expensive and circular but works well in practice.
# lib/llm_evals/scorers/exact_match.rb
module LlmEvals
module Scorers
class ExactMatch
def initialize(fields:)
@fields = fields
end
def call(example, response)
parsed = JSON.parse(response.body)
mismatches = @fields.each_with_object({}) do |field, acc|
expected = example.fetch("expected_#{field}")
actual = parsed[field.to_s]
acc[field] = { expected: expected, actual: actual } if expected != actual
end
{ pass: mismatches.empty?, mismatches: mismatches }
end
end
end
end
For free-form generation, an LLM judge with a tight rubric is the move. Use a cheaper model than the one under test, give it a five-point rubric, and have it return a structured score. I use Claude Haiku to judge prompts that run on Claude Sonnet — same family, different cost class, predictable agreement with human reviewers.
# lib/llm_evals/scorers/llm_judge.rb
module LlmEvals
module Scorers
class LlmJudge
RUBRIC = <<~PROMPT
You are grading a customer-support reply. Return JSON:
{ "factual": 0..2, "tone": 0..2, "actionable": 0..2, "notes": "..." }
- factual: 2 if every claim is supported by the input, 0 if any hallucination.
- tone: 2 if professional and empathetic, 0 if dismissive or robotic.
- actionable: 2 if the customer can act on this immediately, 0 if not.
Customer message:
---
%{input}
---
Proposed reply:
---
%{reply}
---
PROMPT
def initialize(judge_client:)
@judge = judge_client
end
def call(example, response)
rendered = RUBRIC % { input: example.fetch("input"), reply: response.body }
judgment = JSON.parse(@judge.complete(rendered).body)
total = judgment.values_at("factual", "tone", "actionable").sum
{ pass: total >= 5, judgment: judgment, total: total }
end
end
end
end
The judge is not infallible. Spot-check 10 percent of its grades against your own judgment when you first wire it up, then again every time you change the rubric. If the judge agrees with you 90 percent of the time, it is good enough to catch regressions.
Wiring Evals Into CI
The eval suite runs on every pull request that touches a prompt file, a model configuration, or the eval code itself. The job posts back to the PR with three numbers: pass rate against the dataset, change in pass rate versus main, and total token cost of the run.
# .github/workflows/llm_evals.yml
name: llm-evals
on:
pull_request:
paths:
- "app/prompts/**"
- "config/llm.yml"
- "lib/llm_evals/**"
- "test/evals/**"
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ruby/setup-ruby@v1
with: { bundler-cache: true }
- run: bundle exec rake llm_evals:all > eval_report.json
env:
ANTHROPIC_API_KEY: $
- uses: actions/github-script@v7
with:
script: |
const report = require('./eval_report.json');
const body = `## LLM Eval Results\n` +
`- Pass rate: **${report.pass_rate}%** (was ${report.baseline}%)\n` +
`- Failures: ${report.failures.length}\n` +
`- Cost this run: $${report.cost.toFixed(2)}\n`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
Fail the build when the pass rate drops by more than two percentage points from main, or when the total cost rises by more than 50 percent. The cost guard is the one that catches you adding an inadvertent retry loop or a runaway tool-use loop before it hits production. If you are not familiar with how those loops can explode token bills, the Rails AI agents post covers the failure modes in more detail.
Tracking Cost and Latency, Not Just Accuracy
A prompt that is 1 percent more accurate and 4x more expensive is rarely worth shipping. Every eval run records token usage and latency per example, and the report surfaces the p50 and p95 of both. I have killed plenty of “better” prompts during review because their latency at p95 broke our SLO.
def summarize(results)
costs = results.map { |r| token_cost(r[:tokens]) }
latencies = results.map { |r| r[:latency_ms] }.sort
{
pass_rate: (results.count { |r| r[:score][:pass] }.fdiv(results.size) * 100).round(1),
failures: results.reject { |r| r[:score][:pass] },
cost: costs.sum.round(2),
p50_latency_ms: latencies[latencies.size / 2],
p95_latency_ms: latencies[(latencies.size * 0.95).floor]
}
end
Pair this with Anthropic prompt caching on the test runs themselves — a stable system prompt across 200 eval examples is exactly the workload caching was designed for, and it cuts the eval cost by 60 to 80 percent.
Handling Non-Determinism Without Going Insane
Two strategies. First, set temperature: 0 on the model when running evals, so the same input produces the same output to the limit the provider can guarantee. This makes regression detection sharp. Second, for prompts that genuinely need higher temperature in production, run each example three times and require at least two of three to pass. It triples the eval cost but tells you whether the prompt is robust or just lucky.
def call_with_majority(example)
responses = 3.times.map { @prompt.call(example.fetch("input")) }
scores = responses.map { |r| @scorer.call(example, r) }
pass_count = scores.count { |s| s[:pass] }
{ pass: pass_count >= 2, individual: scores }
end
I run cheap classification evals on every PR. I run the more expensive judge-based evals on a nightly schedule and post the trend to a Slack channel. That cadence catches the regressions that matter without burning the API budget.
What Evals Will Not Catch
Evals catch behavioral regressions on your dataset. They will not catch a prompt being misused on inputs you did not anticipate. They will not catch a model provider silently retraining the underlying model and changing its defaults. They will not catch user-perception drift where the same accuracy somehow lands differently in production. For those you need production-side monitoring — sampled human review of live outputs, customer-facing thumbs-up/thumbs-down feedback that flows back into the dataset, and a weekly eyeball pass over a slice of real outputs by someone who knows the product.
Evals are necessary. They are not sufficient. They are the floor that lets you ship prompt changes faster than once a quarter.
FAQ
How many examples does a Rails LLM eval dataset need?
For classification and structured-extraction prompts, 50 examples is a usable starting floor and 150 to 200 is comfortable. For free-form generation graded by an LLM judge, start at 30 examples because each run is expensive. The metric to watch is whether the pass rate moves meaningfully when you intentionally regress the prompt during testing. If a known-bad prompt still passes, the dataset is too small or too easy.
Should Rails LLM evals run on every commit or only on prompt changes?
Gate them on changes to prompt files, model configuration, and the eval code itself, plus a nightly full run. Running them on every commit burns the API budget for changes that cannot affect outcomes. The nightly run catches drift from the model provider’s side — Anthropic and OpenAI both ship silent improvements to their models, and your evals will tell you when one of those improvements is actually a regression for your use case.
How do you handle Rails LLM evals when the model output is non-deterministic?
Two approaches that compose well. Set temperature: 0 for evals so the model is as deterministic as the provider allows. For prompts that need higher temperature in production, run each example three times and require a majority pass. The latter triples the eval cost but tells you whether the prompt is genuinely robust or whether it occasionally produces a good output by luck.
Are Rails LLM evals worth it for low-volume internal prompts?
Yes, with a smaller dataset. A 20-example eval for an internal prompt that is called 200 times a day still catches the kind of silent regression that quietly erodes trust in the AI feature among the people who use it. Internal users are quieter about quality drops than paying customers, which makes the regression harder to detect through normal feedback loops.
Building or maintaining LLM features in a Rails app and want a structured eval setup that catches regressions before users do? TTB Software builds Rails LLM eval infrastructure for production AI systems. Nineteen years of Rails, three years of LLMs in production, fixed-fee delivery.
Related Articles
Rails ActionMailer Production Guide: Email Deliverability, Modern APIs, and Bulletproof Testing
Rails ActionMailer production setup: Resend, Postmark, or SendGrid, inbox-reliable delivery, bounce handling, deliver...
Rails Technical Due Diligence: A Fractional CTO Checklist for Acquirers and Investors
Rails technical due diligence checklist from a fractional CTO. What to audit before acquiring or investing in a Rails...
Rails Phlex: Ruby-First View Components That Beat ERB and ViewComponent on Speed
Rails Phlex writes views in pure Ruby — no templates, no DSL surprises. Faster than ERB, smaller than ViewComponent, ...