Rails RAG: Build a Production Retrieval Augmented Generation System with Claude and pgvector
A fintech client called me in January because their support team was drowning. They had six years of policy PDFs, internal runbooks, compliance memos, and Confluence pages — and every new hire spent three weeks learning where anything lived. “Can ChatGPT just answer questions against our docs?” they asked. That is the sentence that launches every RAG project. And after nineteen years of Rails, I can tell you: building Rails RAG that actually works in production is less about the model and more about the plumbing around it.
This post is the playbook I wish I had when I built that first system. We will wire up a full Rails RAG pipeline with pgvector for storage, Claude for generation, and the kind of production concerns nobody warns you about until you get paged at two in the morning.
What Rails RAG Actually Is
RAG — Retrieval Augmented Generation — is the pattern where you do not fine-tune a model on your documents. Instead, when a user asks a question, you retrieve the relevant passages from your own data and feed them into the prompt so the model can answer grounded in your content.
A Rails RAG system has three stages:
- Ingestion. Chunk your documents, embed each chunk, store the vectors.
- Retrieval. When a question comes in, embed it, look up the nearest chunks, assemble a context.
- Generation. Pass the question plus retrieved context to an LLM like Claude and stream the answer.
Every single one of those steps has a production trap. Let’s walk them.
Why Rails Is a Great Fit for Retrieval Augmented Generation
Most RAG tutorials are written in Python with LangChain and assume you have a data science team. Rails teams tend to see that stack and assume RAG is not for them. That is backwards. Postgres with pgvector handles embeddings beautifully, Active Job handles ingestion, Active Record handles chunk metadata, and Action Controller::Live handles streaming. You do not need a new service. You need a new table.
I have built three production RAG systems in Rails now. None of them introduced a new runtime. All three ran on the existing Postgres the app already had.
Ingestion: Chunk, Embed, Store
The single highest-leverage decision in Rails RAG is how you chunk your documents. Chunks that are too big dilute the relevant signal. Chunks that are too small lose context. My default: 800 tokens with 100 tokens of overlap, measured by a real tokenizer, not character counts.
Here is the schema. Assumes you have already set up pgvector — if not, start with the pgvector and semantic search guide.
class CreateDocumentChunks < ActiveRecord::Migration[8.0]
def change
create_table :document_chunks do |t|
t.references :document, null: false, foreign_key: true
t.text :content, null: false
t.integer :position, null: false
t.integer :token_count, null: false
t.jsonb :metadata, default: {}
t.column :embedding, :vector, limit: 1536
t.timestamps
end
add_index :document_chunks,
:embedding,
using: :hnsw,
opclass: :vector_cosine_ops,
name: "index_chunks_on_embedding_hnsw"
end
end
The chunker. I use the tiktoken_ruby gem because OpenAI’s text-embedding-3-small is still the best price-per-quality embedding on the market, and its tokenizer is the one we need to respect.
class Chunker
CHUNK_TOKENS = 800
OVERLAP_TOKENS = 100
def initialize(text)
@text = text
@encoder = Tiktoken.encoding_for_model("text-embedding-3-small")
end
def call
tokens = @encoder.encode(@text)
stride = CHUNK_TOKENS - OVERLAP_TOKENS
chunks = []
(0...tokens.length).step(stride) do |start|
window = tokens[start, CHUNK_TOKENS]
break if window.nil? || window.empty?
chunks << {
content: @encoder.decode(window),
token_count: window.length,
position: chunks.length
}
break if start + CHUNK_TOKENS >= tokens.length
end
chunks
end
end
Do not ingest in the web request. Embedding API calls are slow, rate-limited, and occasionally flaky. Push ingestion into a background job — Solid Queue or Sidekiq both work fine.
class IngestDocumentJob < ApplicationJob
queue_as :ingest
def perform(document_id)
document = Document.find(document_id)
chunks = Chunker.new(document.body).call
embeddings = EmbeddingClient.new.embed_batch(chunks.map { _1[:content] })
DocumentChunk.transaction do
document.document_chunks.delete_all
chunks.each_with_index do |chunk, i|
document.document_chunks.create!(
chunk.merge(embedding: embeddings[i])
)
end
end
end
end
Two production notes that matter. First: batch your embedding calls. OpenAI accepts up to 2048 inputs per request. Calling the API once per chunk will burn your rate limit and your budget. Second: wrap the delete-and-recreate in a transaction, otherwise a mid-ingest crash leaves you with half a document and you will not notice until a user asks about the missing half.
Retrieval: The Query Pipeline
Retrieval is where naive RAG implementations fall apart. Embedding the user’s raw question and looking up the top five chunks works fine for demos. It fails the moment a real user types “what about the refund stuff we changed last year” because that question has no semantic overlap with the actual policy text.
The pattern that holds up in production is: rewrite, retrieve, rerank.
class RagRetriever
def initialize(question, user:)
@question = question
@user = user
end
def call
rewritten = rewrite_question
candidates = vector_search(rewritten, limit: 20)
rerank(candidates, original: @question)
end
private
def rewrite_question
ClaudeClient.new.complete(
system: "Rewrite the user question as a standalone search query. No preamble.",
user: @question,
max_tokens: 120
)
end
def vector_search(text, limit:)
vector = EmbeddingClient.new.embed(text)
DocumentChunk
.joins(:document)
.where(documents: { tenant_id: @user.tenant_id })
.order(Arel.sql("embedding <=> '#{vector}'"))
.limit(limit)
end
def rerank(candidates, original:)
pairs = candidates.map { |c| [original, c.content] }
scores = RerankerClient.new.score(pairs)
candidates.zip(scores).sort_by { -_2 }.first(6).map(&:first)
end
end
Three things deserve emphasis here. The tenant scope in vector_search is not optional — every production RAG bug I have debugged eventually traced back to someone retrieving another tenant’s chunks. The reranker (Cohere, Jina, or a small local cross-encoder) nearly doubles the quality of the top three results and is worth the extra 80ms. And the question rewrite step turns conversational follow-ups into real search queries, which is the single biggest win over naive retrieval.
Generation: Feeding Context to Claude
I default to Claude Sonnet 4.6 for the generation step in Rails RAG. It is the best-behaved model I have used when you tell it “answer only from the provided context.” Opus is overkill for most RAG. Haiku is fast but hallucinates more when the context is ambiguous.
class RagAnswerer
SYSTEM_PROMPT = <<~PROMPT
You are a support assistant. Answer ONLY using the provided context.
If the context does not contain the answer, say "I don't have that in
my knowledge base" and stop. Cite the source document ID for every claim
using the format [doc:123].
PROMPT
def initialize(question:, chunks:)
@question = question
@chunks = chunks
end
def call(&block)
context = @chunks.map { |c|
"[doc:#{c.document_id}] #{c.content}"
}.join("\n\n---\n\n")
ClaudeClient.new.stream(
system: SYSTEM_PROMPT,
user: "Context:\n\n#{context}\n\nQuestion: #{@question}",
max_tokens: 1024,
&block
)
end
end
Stream the output straight to the browser. I covered this pattern in detail in streaming LLM responses with ActionController::Live, but the short version is: RAG without streaming feels broken because the model thinks for three seconds before the first token. With streaming, the user sees an answer forming in under 400ms.
Controlling Hallucinations in Rails RAG
The single most dangerous failure mode of retrieval augmented generation in Rails is confident wrong answers. The model will happily make up a policy that does not exist if your retrieval returned irrelevant chunks. Three things keep this under control.
First: lower the temperature. I use 0.2 for RAG. Creativity is not a feature here.
Second: force citations. Notice the [doc:123] format in the system prompt above. In production I post-process the streamed response and verify every cited document ID actually appears in the retrieved chunks. If the model invents a citation, I log it and flag the answer.
Third: have a refusal answer. The “I don’t have that in my knowledge base” response is worth more than any clever prompt tweak. Users tolerate “I don’t know.” They do not tolerate a fabricated refund policy.
class HallucinationGuard
CITATION_PATTERN = /\[doc:(\d+)\]/
def initialize(answer:, retrieved_ids:)
@answer = answer
@retrieved_ids = retrieved_ids.to_set
end
def verified?
cited = @answer.scan(CITATION_PATTERN).flatten.map(&:to_i).to_set
cited.subset?(@retrieved_ids)
end
end
Caching: The Part Nobody Tells You About
A production Rails RAG pipeline makes three LLM calls per answer — rewrite, embed, generate — plus a reranker round-trip. At scale that gets expensive fast. Three caches change the economics:
- Embedding cache keyed on a SHA256 of the text. Embeddings are deterministic for a given model; cache them forever.
- Rewrite cache keyed on the question. Short TTL — maybe an hour — because people phrase the same question many ways.
- Answer cache keyed on question plus retrieved chunk IDs. Also short TTL. This is the one that pays for itself within a week.
All three fit comfortably in Rails.cache, backed by Solid Cache or Redis.
Observability for RAG
You cannot debug a RAG system without logs. Every request should record the rewritten query, the retrieved chunk IDs, the reranker scores, the final prompt token count, and the output. I store all of this in a rag_traces table and build a tiny internal dashboard on top of it. When a user complains “the bot gave me the wrong answer,” I need to see exactly what retrieval returned. Without that table, you are guessing.
FAQ
What is the difference between RAG and fine-tuning for a Rails app?
Fine-tuning bakes knowledge into the model weights and is expensive, slow to update, and locks you to a model version. Rails RAG keeps knowledge in your database where you can change it instantly, scope it per tenant, and swap the underlying model without retraining. For 95% of business use cases, retrieval augmented generation is the correct choice.
Do I need a vector database separate from Postgres?
No. pgvector on Postgres handles tens of millions of chunks comfortably with an HNSW index. I have never hit a scale where pulling in Pinecone, Weaviate, or Qdrant was justified for a Rails app. Keep it in Postgres until you have measured proof you need more.
How big should my chunks be for Rails RAG?
Start at 800 tokens with 100 tokens of overlap and measure. Shorter chunks (200–400 tokens) improve precision for lookup-style questions. Longer chunks (1200–1500 tokens) work better for explanatory questions where context matters. If you only pick one number, 800 is a good default.
Which LLM should I use for generation in a Rails RAG pipeline?
Claude Sonnet 4.6 is my default because it follows the “answer only from context” instruction more reliably than any other model I have tested. GPT-4o is close. If latency matters more than accuracy, Claude Haiku 4.5 is cheaper and faster but needs tighter prompting to avoid hallucination.
Need help shipping a production RAG system in Rails? TTB Software specializes in AI-augmented Rails platforms — embeddings, retrieval pipelines, and Claude integrations that hold up under real user load. We have been building Rails systems for nineteen years.
About the Author
Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.
Get in TouchRelated Articles
Need Expert Rails Development?
Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise
Schedule a Free Consultation