35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English 35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English
Rails RAG: Build a Production Retrieval Augmented Generation System with Claude and pgvector

Rails RAG: Build a Production Retrieval Augmented Generation System with Claude and pgvector

Roger Heykoop
AI in Rails, Ruby on Rails
Rails RAG guide: build a production retrieval augmented generation pipeline with pgvector, Claude and streaming. Real code, real chunking, real tradeoffs.

A fintech client called me in January because their support team was drowning. They had six years of policy PDFs, internal runbooks, compliance memos, and Confluence pages — and every new hire spent three weeks learning where anything lived. “Can ChatGPT just answer questions against our docs?” they asked. That is the sentence that launches every RAG project. And after nineteen years of Rails, I can tell you: building Rails RAG that actually works in production is less about the model and more about the plumbing around it.

This post is the playbook I wish I had when I built that first system. We will wire up a full Rails RAG pipeline with pgvector for storage, Claude for generation, and the kind of production concerns nobody warns you about until you get paged at two in the morning.

What Rails RAG Actually Is

RAG — Retrieval Augmented Generation — is the pattern where you do not fine-tune a model on your documents. Instead, when a user asks a question, you retrieve the relevant passages from your own data and feed them into the prompt so the model can answer grounded in your content.

A Rails RAG system has three stages:

  1. Ingestion. Chunk your documents, embed each chunk, store the vectors.
  2. Retrieval. When a question comes in, embed it, look up the nearest chunks, assemble a context.
  3. Generation. Pass the question plus retrieved context to an LLM like Claude and stream the answer.

Every single one of those steps has a production trap. Let’s walk them.

Why Rails Is a Great Fit for Retrieval Augmented Generation

Most RAG tutorials are written in Python with LangChain and assume you have a data science team. Rails teams tend to see that stack and assume RAG is not for them. That is backwards. Postgres with pgvector handles embeddings beautifully, Active Job handles ingestion, Active Record handles chunk metadata, and Action Controller::Live handles streaming. You do not need a new service. You need a new table.

I have built three production RAG systems in Rails now. None of them introduced a new runtime. All three ran on the existing Postgres the app already had.

Ingestion: Chunk, Embed, Store

The single highest-leverage decision in Rails RAG is how you chunk your documents. Chunks that are too big dilute the relevant signal. Chunks that are too small lose context. My default: 800 tokens with 100 tokens of overlap, measured by a real tokenizer, not character counts.

Here is the schema. Assumes you have already set up pgvector — if not, start with the pgvector and semantic search guide.

class CreateDocumentChunks < ActiveRecord::Migration[8.0]
  def change
    create_table :document_chunks do |t|
      t.references :document, null: false, foreign_key: true
      t.text :content, null: false
      t.integer :position, null: false
      t.integer :token_count, null: false
      t.jsonb :metadata, default: {}
      t.column :embedding, :vector, limit: 1536
      t.timestamps
    end

    add_index :document_chunks,
              :embedding,
              using: :hnsw,
              opclass: :vector_cosine_ops,
              name: "index_chunks_on_embedding_hnsw"
  end
end

The chunker. I use the tiktoken_ruby gem because OpenAI’s text-embedding-3-small is still the best price-per-quality embedding on the market, and its tokenizer is the one we need to respect.

class Chunker
  CHUNK_TOKENS = 800
  OVERLAP_TOKENS = 100

  def initialize(text)
    @text = text
    @encoder = Tiktoken.encoding_for_model("text-embedding-3-small")
  end

  def call
    tokens = @encoder.encode(@text)
    stride = CHUNK_TOKENS - OVERLAP_TOKENS
    chunks = []

    (0...tokens.length).step(stride) do |start|
      window = tokens[start, CHUNK_TOKENS]
      break if window.nil? || window.empty?
      chunks << {
        content: @encoder.decode(window),
        token_count: window.length,
        position: chunks.length
      }
      break if start + CHUNK_TOKENS >= tokens.length
    end

    chunks
  end
end

Do not ingest in the web request. Embedding API calls are slow, rate-limited, and occasionally flaky. Push ingestion into a background job — Solid Queue or Sidekiq both work fine.

class IngestDocumentJob < ApplicationJob
  queue_as :ingest

  def perform(document_id)
    document = Document.find(document_id)
    chunks = Chunker.new(document.body).call

    embeddings = EmbeddingClient.new.embed_batch(chunks.map { _1[:content] })

    DocumentChunk.transaction do
      document.document_chunks.delete_all
      chunks.each_with_index do |chunk, i|
        document.document_chunks.create!(
          chunk.merge(embedding: embeddings[i])
        )
      end
    end
  end
end

Two production notes that matter. First: batch your embedding calls. OpenAI accepts up to 2048 inputs per request. Calling the API once per chunk will burn your rate limit and your budget. Second: wrap the delete-and-recreate in a transaction, otherwise a mid-ingest crash leaves you with half a document and you will not notice until a user asks about the missing half.

Retrieval: The Query Pipeline

Retrieval is where naive RAG implementations fall apart. Embedding the user’s raw question and looking up the top five chunks works fine for demos. It fails the moment a real user types “what about the refund stuff we changed last year” because that question has no semantic overlap with the actual policy text.

The pattern that holds up in production is: rewrite, retrieve, rerank.

class RagRetriever
  def initialize(question, user:)
    @question = question
    @user = user
  end

  def call
    rewritten = rewrite_question
    candidates = vector_search(rewritten, limit: 20)
    rerank(candidates, original: @question)
  end

  private

  def rewrite_question
    ClaudeClient.new.complete(
      system: "Rewrite the user question as a standalone search query. No preamble.",
      user: @question,
      max_tokens: 120
    )
  end

  def vector_search(text, limit:)
    vector = EmbeddingClient.new.embed(text)
    DocumentChunk
      .joins(:document)
      .where(documents: { tenant_id: @user.tenant_id })
      .order(Arel.sql("embedding <=> '#{vector}'"))
      .limit(limit)
  end

  def rerank(candidates, original:)
    pairs = candidates.map { |c| [original, c.content] }
    scores = RerankerClient.new.score(pairs)
    candidates.zip(scores).sort_by { -_2 }.first(6).map(&:first)
  end
end

Three things deserve emphasis here. The tenant scope in vector_search is not optional — every production RAG bug I have debugged eventually traced back to someone retrieving another tenant’s chunks. The reranker (Cohere, Jina, or a small local cross-encoder) nearly doubles the quality of the top three results and is worth the extra 80ms. And the question rewrite step turns conversational follow-ups into real search queries, which is the single biggest win over naive retrieval.

Generation: Feeding Context to Claude

I default to Claude Sonnet 4.6 for the generation step in Rails RAG. It is the best-behaved model I have used when you tell it “answer only from the provided context.” Opus is overkill for most RAG. Haiku is fast but hallucinates more when the context is ambiguous.

class RagAnswerer
  SYSTEM_PROMPT = <<~PROMPT
    You are a support assistant. Answer ONLY using the provided context.
    If the context does not contain the answer, say "I don't have that in
    my knowledge base" and stop. Cite the source document ID for every claim
    using the format [doc:123].
  PROMPT

  def initialize(question:, chunks:)
    @question = question
    @chunks = chunks
  end

  def call(&block)
    context = @chunks.map { |c|
      "[doc:#{c.document_id}] #{c.content}"
    }.join("\n\n---\n\n")

    ClaudeClient.new.stream(
      system: SYSTEM_PROMPT,
      user: "Context:\n\n#{context}\n\nQuestion: #{@question}",
      max_tokens: 1024,
      &block
    )
  end
end

Stream the output straight to the browser. I covered this pattern in detail in streaming LLM responses with ActionController::Live, but the short version is: RAG without streaming feels broken because the model thinks for three seconds before the first token. With streaming, the user sees an answer forming in under 400ms.

Controlling Hallucinations in Rails RAG

The single most dangerous failure mode of retrieval augmented generation in Rails is confident wrong answers. The model will happily make up a policy that does not exist if your retrieval returned irrelevant chunks. Three things keep this under control.

First: lower the temperature. I use 0.2 for RAG. Creativity is not a feature here.

Second: force citations. Notice the [doc:123] format in the system prompt above. In production I post-process the streamed response and verify every cited document ID actually appears in the retrieved chunks. If the model invents a citation, I log it and flag the answer.

Third: have a refusal answer. The “I don’t have that in my knowledge base” response is worth more than any clever prompt tweak. Users tolerate “I don’t know.” They do not tolerate a fabricated refund policy.

class HallucinationGuard
  CITATION_PATTERN = /\[doc:(\d+)\]/

  def initialize(answer:, retrieved_ids:)
    @answer = answer
    @retrieved_ids = retrieved_ids.to_set
  end

  def verified?
    cited = @answer.scan(CITATION_PATTERN).flatten.map(&:to_i).to_set
    cited.subset?(@retrieved_ids)
  end
end

Caching: The Part Nobody Tells You About

A production Rails RAG pipeline makes three LLM calls per answer — rewrite, embed, generate — plus a reranker round-trip. At scale that gets expensive fast. Three caches change the economics:

  1. Embedding cache keyed on a SHA256 of the text. Embeddings are deterministic for a given model; cache them forever.
  2. Rewrite cache keyed on the question. Short TTL — maybe an hour — because people phrase the same question many ways.
  3. Answer cache keyed on question plus retrieved chunk IDs. Also short TTL. This is the one that pays for itself within a week.

All three fit comfortably in Rails.cache, backed by Solid Cache or Redis.

Observability for RAG

You cannot debug a RAG system without logs. Every request should record the rewritten query, the retrieved chunk IDs, the reranker scores, the final prompt token count, and the output. I store all of this in a rag_traces table and build a tiny internal dashboard on top of it. When a user complains “the bot gave me the wrong answer,” I need to see exactly what retrieval returned. Without that table, you are guessing.

FAQ

What is the difference between RAG and fine-tuning for a Rails app?

Fine-tuning bakes knowledge into the model weights and is expensive, slow to update, and locks you to a model version. Rails RAG keeps knowledge in your database where you can change it instantly, scope it per tenant, and swap the underlying model without retraining. For 95% of business use cases, retrieval augmented generation is the correct choice.

Do I need a vector database separate from Postgres?

No. pgvector on Postgres handles tens of millions of chunks comfortably with an HNSW index. I have never hit a scale where pulling in Pinecone, Weaviate, or Qdrant was justified for a Rails app. Keep it in Postgres until you have measured proof you need more.

How big should my chunks be for Rails RAG?

Start at 800 tokens with 100 tokens of overlap and measure. Shorter chunks (200–400 tokens) improve precision for lookup-style questions. Longer chunks (1200–1500 tokens) work better for explanatory questions where context matters. If you only pick one number, 800 is a good default.

Which LLM should I use for generation in a Rails RAG pipeline?

Claude Sonnet 4.6 is my default because it follows the “answer only from context” instruction more reliably than any other model I have tested. GPT-4o is close. If latency matters more than accuracy, Claude Haiku 4.5 is cheaper and faster but needs tighter prompting to avoid hallucination.


Need help shipping a production RAG system in Rails? TTB Software specializes in AI-augmented Rails platforms — embeddings, retrieval pipelines, and Claude integrations that hold up under real user load. We have been building Rails systems for nineteen years.

#rails-rag #retrieval-augmented-generation-rails #rails-claude-api #pgvector-rag #anthropic-claude-ruby #llm-rails-production #ai-rails
R

About the Author

Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.

Get in Touch

Share this article

Need Expert Rails Development?

Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise

Schedule a Free Consultation