AI & LLMS · 17 MIN READ ·

Rails LLM Evals: Testing Prompts in CI Before They Break Production

Rails LLM evals catch prompt regressions before they ship to production. Build golden datasets, score CI runs, and track token costs for Claude and OpenAI.

Rails LLM Evals: Testing Prompts in CI Before They Break Production

A Rails team I advise shipped a one-line change to a customer-support classifier prompt on a Wednesday afternoon. The prompt had been quietly handling 18,000 inbound tickets a day for nine months. By Thursday morning, urgent escalations were being routed to the billing inbox, billing disputes were going to the engineering on-call rotation, and the support lead was sitting at her desk wondering whether she had lost her mind. The change was three words. They thought it would make the prompt more concise. It silently flipped the meaning of one of the categories the model had to predict. There was no test. There was no eval. There was just the diff, the deploy, and the wreckage. We spent a Friday writing Rails LLM evals the way we should have written them on day one, and they have not had a silent regression since.

After nineteen years of Rails I have stopped being surprised by what production code does. After three years of running LLMs in production, the only thing that still surprises me is how casually teams treat prompt changes. A prompt is a function. It has inputs, it has outputs, it has a behavior contract, and that contract can be broken by a comma. We test our Ruby. We test our SQL. We have to test our prompts.

Why Rails LLM Evals Are Not Optional Anymore

A traditional test asserts that a function returns a deterministic value. An LLM call does not return a deterministic value. Run the same prompt twice, get two different strings. That is the property that makes people throw up their hands and skip testing entirely. It is also the property that makes evals essential. The whole point of a Rails LLM eval is to assert on the distribution of outputs across a curated dataset, not on a single golden string.

If you have any of the following in production, you owe yourself evals: a classification prompt routing to humans, a summarization prompt presented to a customer, a structured-extraction prompt feeding a database write, a tool-using agent making side-effectful calls, or any prompt where a silent quality drop would harm a user before anyone noticed. That is most production Rails apps that touch AI.

The cost of evals is small. The cost of a quiet regression that runs for two weeks before someone notices is enormous. Last quarter a fintech I work with had a Claude-powered transaction categorizer that silently regressed accuracy from 94 percent to 71 percent after a model upgrade. They caught it on day eleven. The cleanup took a month.

The Anatomy of a Rails LLM Eval

A useful eval has four parts: a dataset of inputs with known good outputs, a runner that calls your prompt against each input, a scorer that produces a per-example judgment, and a reporter that turns a pile of judgments into a pass-fail signal for CI. The Ruby community does not have a dominant framework for this yet, which is good news, because rolling your own is about 200 lines of code and gives you exactly the right ergonomics for your codebase.

I keep evals in test/evals/ or spec/evals/ so they are visible to engineers without being part of the unit-test run. They are a separate Rake task in CI, gated on changes to prompt files or model-configuration files, and they post their summary back to the pull request.

Golden Datasets: Where Eval Quality Lives Or Dies

The single most underrated part of evals is the dataset. A bad dataset will pass a regressed prompt. A good dataset will catch a regression a human reviewer would miss. I aim for 50 to 200 examples per prompt, sampled with intent.

# test/evals/datasets/ticket_classifier.yml
- id: billing_dispute_polite
  input: "Hi, I noticed I was charged twice for my March subscription. Can you help?"
  expected_category: billing
  expected_urgency: medium
  notes: "Polite billing dispute, common path"

- id: angry_outage_caps
  input: "EVERYTHING IS DOWN AND I HAVE A DEMO IN 20 MINUTES, FIX THIS NOW"
  expected_category: outage
  expected_urgency: critical
  notes: "Adversarial caps, critical urgency cue"

- id: feature_request_disguised
  input: "Is there a way to export my data as CSV? I really need this for my board meeting."
  expected_category: feature_request
  expected_urgency: low
  notes: "Sounds urgent but is a feature request"

Where do the examples come from? Three sources. First, the bug reports — every time a customer or an internal reviewer flags a misclassification, that example goes into the dataset before the fix ships. Second, manually constructed adversarial examples for every category boundary you care about. Third, a sampled slice of real production traffic with hand-labeled expected outputs. Treat the dataset as a long-lived asset. Version it. Review it in pull requests. Do not let it rot.

A Minimal Eval Runner in Pure Ruby

Here is the runner I use as a starting point on every Rails project. It is plain Ruby, parallel via threads (LLM calls are I/O bound), and produces a structured result you can pipe anywhere.

# lib/llm_evals/runner.rb
require "yaml"
require "concurrent"

module LlmEvals
  class Runner
    def initialize(dataset_path:, prompt:, scorer:, concurrency: 8)
      @dataset = YAML.load_file(dataset_path, permitted_classes: [Symbol])
      @prompt = prompt
      @scorer = scorer
      @pool = Concurrent::FixedThreadPool.new(concurrency)
    end

    def run
      futures = @dataset.map do |example|
        Concurrent::Promises.future_on(@pool) do
          response = @prompt.call(example.fetch("input"))
          score = @scorer.call(example, response)
          {
            id: example.fetch("id"),
            input: example.fetch("input"),
            expected: example.except("id", "input", "notes"),
            actual: response.body,
            score: score,
            tokens: response.usage,
            latency_ms: response.latency_ms
          }
        end
      end
      futures.map(&:value!)
    ensure
      @pool.shutdown
      @pool.wait_for_termination
    end
  end
end

The prompt and scorer are plain objects with a call method, which means you can swap them per test and you do not need any DSL. Eight concurrent calls is a sensible default — high enough to keep wall-clock time low, low enough that you will not be rate-limited even on a small Anthropic tier.

Two Scorers That Cover 80 Percent of Real Cases

Most production prompts fall into one of three buckets: classification, structured extraction, and free-form generation. The first two are scored with simple deterministic comparisons. The third needs an LLM judge — which sounds expensive and circular but works well in practice.

# lib/llm_evals/scorers/exact_match.rb
module LlmEvals
  module Scorers
    class ExactMatch
      def initialize(fields:)
        @fields = fields
      end

      def call(example, response)
        parsed = JSON.parse(response.body)
        mismatches = @fields.each_with_object({}) do |field, acc|
          expected = example.fetch("expected_#{field}")
          actual = parsed[field.to_s]
          acc[field] = { expected: expected, actual: actual } if expected != actual
        end
        { pass: mismatches.empty?, mismatches: mismatches }
      end
    end
  end
end

For free-form generation, an LLM judge with a tight rubric is the move. Use a cheaper model than the one under test, give it a five-point rubric, and have it return a structured score. I use Claude Haiku to judge prompts that run on Claude Sonnet — same family, different cost class, predictable agreement with human reviewers.

# lib/llm_evals/scorers/llm_judge.rb
module LlmEvals
  module Scorers
    class LlmJudge
      RUBRIC = <<~PROMPT
        You are grading a customer-support reply. Return JSON:
        { "factual": 0..2, "tone": 0..2, "actionable": 0..2, "notes": "..." }

        - factual: 2 if every claim is supported by the input, 0 if any hallucination.
        - tone: 2 if professional and empathetic, 0 if dismissive or robotic.
        - actionable: 2 if the customer can act on this immediately, 0 if not.

        Customer message:
        ---
        %{input}
        ---

        Proposed reply:
        ---
        %{reply}
        ---
      PROMPT

      def initialize(judge_client:)
        @judge = judge_client
      end

      def call(example, response)
        rendered = RUBRIC % { input: example.fetch("input"), reply: response.body }
        judgment = JSON.parse(@judge.complete(rendered).body)
        total = judgment.values_at("factual", "tone", "actionable").sum
        { pass: total >= 5, judgment: judgment, total: total }
      end
    end
  end
end

The judge is not infallible. Spot-check 10 percent of its grades against your own judgment when you first wire it up, then again every time you change the rubric. If the judge agrees with you 90 percent of the time, it is good enough to catch regressions.

Wiring Evals Into CI

The eval suite runs on every pull request that touches a prompt file, a model configuration, or the eval code itself. The job posts back to the PR with three numbers: pass rate against the dataset, change in pass rate versus main, and total token cost of the run.

# .github/workflows/llm_evals.yml
name: llm-evals
on:
  pull_request:
    paths:
      - "app/prompts/**"
      - "config/llm.yml"
      - "lib/llm_evals/**"
      - "test/evals/**"

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ruby/setup-ruby@v1
        with: { bundler-cache: true }
      - run: bundle exec rake llm_evals:all > eval_report.json
        env:
          ANTHROPIC_API_KEY: $
      - uses: actions/github-script@v7
        with:
          script: |
            const report = require('./eval_report.json');
            const body = `## LLM Eval Results\n` +
              `- Pass rate: **${report.pass_rate}%** (was ${report.baseline}%)\n` +
              `- Failures: ${report.failures.length}\n` +
              `- Cost this run: $${report.cost.toFixed(2)}\n`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });

Fail the build when the pass rate drops by more than two percentage points from main, or when the total cost rises by more than 50 percent. The cost guard is the one that catches you adding an inadvertent retry loop or a runaway tool-use loop before it hits production. If you are not familiar with how those loops can explode token bills, the Rails AI agents post covers the failure modes in more detail.

Tracking Cost and Latency, Not Just Accuracy

A prompt that is 1 percent more accurate and 4x more expensive is rarely worth shipping. Every eval run records token usage and latency per example, and the report surfaces the p50 and p95 of both. I have killed plenty of “better” prompts during review because their latency at p95 broke our SLO.

def summarize(results)
  costs = results.map { |r| token_cost(r[:tokens]) }
  latencies = results.map { |r| r[:latency_ms] }.sort
  {
    pass_rate: (results.count { |r| r[:score][:pass] }.fdiv(results.size) * 100).round(1),
    failures: results.reject { |r| r[:score][:pass] },
    cost: costs.sum.round(2),
    p50_latency_ms: latencies[latencies.size / 2],
    p95_latency_ms: latencies[(latencies.size * 0.95).floor]
  }
end

Pair this with Anthropic prompt caching on the test runs themselves — a stable system prompt across 200 eval examples is exactly the workload caching was designed for, and it cuts the eval cost by 60 to 80 percent.

Handling Non-Determinism Without Going Insane

Two strategies. First, set temperature: 0 on the model when running evals, so the same input produces the same output to the limit the provider can guarantee. This makes regression detection sharp. Second, for prompts that genuinely need higher temperature in production, run each example three times and require at least two of three to pass. It triples the eval cost but tells you whether the prompt is robust or just lucky.

def call_with_majority(example)
  responses = 3.times.map { @prompt.call(example.fetch("input")) }
  scores = responses.map { |r| @scorer.call(example, r) }
  pass_count = scores.count { |s| s[:pass] }
  { pass: pass_count >= 2, individual: scores }
end

I run cheap classification evals on every PR. I run the more expensive judge-based evals on a nightly schedule and post the trend to a Slack channel. That cadence catches the regressions that matter without burning the API budget.

What Evals Will Not Catch

Evals catch behavioral regressions on your dataset. They will not catch a prompt being misused on inputs you did not anticipate. They will not catch a model provider silently retraining the underlying model and changing its defaults. They will not catch user-perception drift where the same accuracy somehow lands differently in production. For those you need production-side monitoring — sampled human review of live outputs, customer-facing thumbs-up/thumbs-down feedback that flows back into the dataset, and a weekly eyeball pass over a slice of real outputs by someone who knows the product.

Evals are necessary. They are not sufficient. They are the floor that lets you ship prompt changes faster than once a quarter.

FAQ

How many examples does a Rails LLM eval dataset need?

For classification and structured-extraction prompts, 50 examples is a usable starting floor and 150 to 200 is comfortable. For free-form generation graded by an LLM judge, start at 30 examples because each run is expensive. The metric to watch is whether the pass rate moves meaningfully when you intentionally regress the prompt during testing. If a known-bad prompt still passes, the dataset is too small or too easy.

Should Rails LLM evals run on every commit or only on prompt changes?

Gate them on changes to prompt files, model configuration, and the eval code itself, plus a nightly full run. Running them on every commit burns the API budget for changes that cannot affect outcomes. The nightly run catches drift from the model provider’s side — Anthropic and OpenAI both ship silent improvements to their models, and your evals will tell you when one of those improvements is actually a regression for your use case.

How do you handle Rails LLM evals when the model output is non-deterministic?

Two approaches that compose well. Set temperature: 0 for evals so the model is as deterministic as the provider allows. For prompts that need higher temperature in production, run each example three times and require a majority pass. The latter triples the eval cost but tells you whether the prompt is genuinely robust or whether it occasionally produces a good output by luck.

Are Rails LLM evals worth it for low-volume internal prompts?

Yes, with a smaller dataset. A 20-example eval for an internal prompt that is called 200 times a day still catches the kind of silent regression that quietly erodes trust in the AI feature among the people who use it. Internal users are quieter about quality drops than paying customers, which makes the regression harder to detect through normal feedback loops.

Building or maintaining LLM features in a Rails app and want a structured eval setup that catches regressions before users do? TTB Software builds Rails LLM eval infrastructure for production AI systems. Nineteen years of Rails, three years of LLMs in production, fixed-fee delivery.

#rails-llm-evals #prompt-testing-ci #claude-rails-testing #llm-regression-tests #ai-cost-monitoring-rails #llm-golden-datasets #anthropic-rails-evaluation

Related Articles

Last section. Then please call.

It's a phone call. That's the worst it can get.

No discovery deck. No 45-minute "qualification" call. 30 minutes, your problem, my opinion. If we're a fit, you'll know by minute 12.

Direct line — answered by Roger
+31 6 5123 6132
Mon–Fri, 09:00–18:00 CET · Currently available

OR
info@ttb.software