Semantic Search in Rails with pgvector: From Zero to Production
A client came to me last year with a support ticket queue problem. They had four years of resolved tickets in their Rails app — over 80,000 of them — and their support team spent twenty minutes per new ticket just searching for similar past cases. The search was keyword-based. A ticket about “app won’t open” returned zero results when similar past tickets said “application fails to launch.” Same problem, different words, useless search.
Semantic search solved it in an afternoon. Not because I’m clever, but because pgvector and OpenAI embeddings have gotten genuinely simple to integrate into a Rails stack. Here’s exactly what I built, and how you can do the same.
What Vector Embeddings Actually Are
Forget the math for now. An embedding is a list of numbers — a vector — that represents the meaning of a piece of text. Two sentences that mean the same thing will have vectors that are close together in space, even if they share no words. “App won’t open” and “application fails to launch” end up nearly identical vectors. “Database migration guide” ends up far away.
You generate these vectors by sending text to an embedding model (OpenAI’s text-embedding-3-small is fast and cheap). You store the vectors in Postgres using the pgvector extension. You query for similarity using a dot product or cosine distance. That’s it.
Setting Up pgvector
First, the extension. If you’re on managed Postgres (RDS, Supabase, Render), pgvector is likely already available. For a fresh install on Debian/Ubuntu:
sudo apt install postgresql-16-pgvector
Then enable it in a Rails migration:
class EnablePgvector < ActiveRecord::Migration[8.0]
def up
execute "CREATE EXTENSION IF NOT EXISTS vector"
end
def down
execute "DROP EXTENSION IF EXISTS vector"
end
end
Add the pgvector gem to your Gemfile:
# Gemfile
gem "pgvector"
Then add a vector column to whichever table you want to make searchable. For the support ticket example:
class AddEmbeddingToTickets < ActiveRecord::Migration[8.0]
def change
add_column :tickets, :embedding, :vector, limit: 1536
end
end
The limit: 1536 matches the dimensionality of OpenAI’s text-embedding-3-small. If you use a different model, adjust accordingly (text-embedding-3-large is 3072).
Generating and Storing Embeddings
Wire up the model:
# app/models/ticket.rb
class Ticket < ApplicationRecord
include Pgvector::ActiveRecord
has_neighbors :embedding
after_create_commit :generate_embedding, if: :embeddable?
def embeddable?
subject.present? && body.present?
end
private
def generate_embedding
GenerateEmbeddingJob.perform_later(self)
end
end
The job that calls OpenAI:
# app/jobs/generate_embedding_job.rb
class GenerateEmbeddingJob < ApplicationJob
queue_as :embeddings
def perform(record)
text = [record.subject, record.body].compact.join("\n\n")
vector = EmbeddingService.generate(text)
record.update_columns(embedding: vector)
end
end
And the service wrapper around the OpenAI client:
# app/services/embedding_service.rb
class EmbeddingService
MODEL = "text-embedding-3-small"
def self.generate(text)
response = client.embeddings(
parameters: {
model: MODEL,
input: text.truncate(8000) # stay within token limits
}
)
response.dig("data", 0, "embedding")
end
def self.client
@client ||= OpenAI::Client.new(access_token: Rails.application.credentials.openai_api_key)
end
end
update_columns bypasses callbacks intentionally — you don’t want the after_create_commit to re-trigger and cause an embedding loop.
Querying: Find Similar Records
With pgvector’s has_neighbors, you get a scope for free:
# Find the 10 tickets most similar to a given ticket
Ticket.nearest_neighbors(:embedding, ticket.embedding, distance: "cosine").limit(10)
For a search box where you have a raw query string, generate an embedding for the query first:
# app/services/ticket_search.rb
class TicketSearch
def self.call(query, limit: 10)
return Ticket.none if query.blank?
query_vector = EmbeddingService.generate(query)
Ticket
.nearest_neighbors(:embedding, query_vector, distance: "cosine")
.where.not(embedding: nil)
.limit(limit)
end
end
In the controller:
# app/controllers/tickets_controller.rb
def index
@tickets = if params[:q].present?
TicketSearch.call(params[:q])
else
Ticket.order(created_at: :desc).limit(50)
end
end
That’s the core. Ninety lines of code and your search understands meaning rather than matching keywords.
Adding an HNSW Index for Production
An exact nearest-neighbor search scans every row in the table. Fine for 10,000 records; painful at 500,000. pgvector supports two approximate nearest-neighbor index types: IVFFlat and HNSW. HNSW is better for most use cases — faster queries at the cost of slightly more index build time.
class AddHnswIndexToTicketsEmbedding < ActiveRecord::Migration[8.0]
def up
execute <<~SQL
CREATE INDEX tickets_embedding_hnsw_idx
ON tickets
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
SQL
end
def down
remove_index :tickets, name: :tickets_embedding_hnsw_idx
end
end
The parameters m and ef_construction control the quality/speed tradeoff. For most production workloads, m = 16 and ef_construction = 64 are fine starting points. Raise ef_construction if recall quality matters more than index build time.
Run VACUUM ANALYZE tickets after building the index to update query planner statistics.
Backfilling Existing Records
You probably have records that existed before you added the embedding column. Don’t try to backfill them all in a single migration — the API calls will time out and you’ll hold locks. Use a background job with in_batches:
# lib/tasks/embeddings.rake
namespace :embeddings do
desc "Backfill embeddings for tickets missing them"
task backfill: :environment do
Ticket.where(embedding: nil).in_batches(of: 100) do |batch|
batch.each do |ticket|
GenerateEmbeddingJob.perform_later(ticket)
end
sleep 0.5 # respect OpenAI rate limits
end
end
end
Run this with bundle exec rails embeddings:backfill. For 80,000 tickets at text-embedding-3-small pricing, you’re looking at roughly $1.20 in API costs total. Cheap.
A Minimal RAG Pipeline
Once you have semantic search working, you’re halfway to RAG (Retrieval-Augmented Generation) — the pattern where you pull relevant context from your database before sending a question to the LLM. Here’s what it looks like added to the ticket system:
# app/services/support_answer.rb
class SupportAnswer
SYSTEM_PROMPT = <<~PROMPT
You are a support assistant. Use the provided past ticket resolutions to suggest
an answer. Be specific and practical. If the past tickets don't cover the question,
say so rather than guessing.
PROMPT
def self.call(question)
similar = TicketSearch.call(question, limit: 5)
context = similar.map { |t| "Q: #{t.subject}\nA: #{t.resolution}" }.join("\n\n---\n\n")
client.chat(
parameters: {
model: "gpt-4o",
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: "Past similar tickets:\n\n#{context}\n\nNew question: #{question}" }
],
temperature: 0.3
}
)
end
def self.client
@client ||= OpenAI::Client.new(access_token: Rails.application.credentials.openai_api_key)
end
end
temperature: 0.3 keeps the answers grounded. You’re not looking for creativity — you want the model to synthesize past resolutions, not invent new ones.
Production Gotchas
Nil embeddings will break your sort. pgvector returns nil for records with null vectors. Scope them out: .where.not(embedding: nil).
Cosine vs. L2 distance. Cosine distance is the right choice for text — it ignores vector magnitude and focuses on direction (meaning). L2 distance is more appropriate for images or numeric features. Stick with cosine for language models.
Keep the embedding model consistent. If you generate some embeddings with text-embedding-3-small and later switch to text-embedding-ada-002, comparisons between old and new vectors are meaningless. Pick a model and stick with it. If you do switch, full backfill required.
Async or bust. Never generate embeddings synchronously in a web request. The OpenAI API adds 100-400ms latency. Always use a background job with a dedicated queue. At high volume, use OpenAI’s batch embedding endpoint — it’s 50% cheaper and designed for bulk workloads.
Chunking for long documents. If you’re embedding documents longer than ~400 words, consider splitting them into overlapping chunks (e.g., 300-word chunks with 50-word overlap) and storing each chunk as a separate vector. Retrieve chunks, deduplicate by parent document, return documents. This is the standard chunking strategy for RAG.
What This Doesn’t Replace
Semantic search is not a replacement for full-text search — it’s a complement. Exact keyword matches, phrase searches, and faceted filtering still work better with pg_search or Postgres’s native tsvector. The right architecture is often a hybrid: run semantic search and keyword search in parallel, merge results with a scoring function, present the union to the user.
Eighteen years into Rails, I’m still surprised by how cleanly the ecosystem absorbs new ideas. pgvector slots into ActiveRecord like it was always meant to be there. The complexity is in the product thinking — what should you embed, how do you chunk it, what context does the LLM actually need — not in the Rails plumbing.
Frequently Asked Questions
Do I need to use OpenAI for embeddings?
No. Any model that produces fixed-size dense vectors works. Alternatives include Mistral embeddings, Cohere Embed, Google’s text-embedding-004, or a locally-hosted model via Ollama. The tradeoff is quality vs. cost vs. latency. OpenAI’s text-embedding-3-small is a sensible default.
How much does this cost at scale?
text-embedding-3-small is priced at $0.02 per million tokens. At an average of 200 tokens per ticket, 100,000 tickets cost about $0.40 to embed. Running the search (embedding the query) costs fractions of a cent per search. Cost is not the constraint — architecture is.
Can I use this without the pgvector gem?
Technically, yes — you can store vectors as arrays and write raw SQL for cosine distance. In practice, the pgvector gem gives you has_neighbors and type casting for free. Use it.
Is HNSW better than IVFFlat?
For most use cases, yes. HNSW has higher recall at equivalent query speed and doesn’t require you to pre-define the number of clusters (nlist in IVFFlat). IVFFlat is useful if you need extremely fast build times and can tolerate a recall tradeoff. If you don’t have a reason to choose IVFFlat, use HNSW.
What happens if the OpenAI API is down?
Records created while the API is unavailable will have nil embeddings. Your background job should use Solid Queue or Sidekiq retries with exponential backoff. When the API recovers, the jobs will complete. Your search will gracefully skip nil-embedding records in the meantime.
Need to add semantic search or a RAG pipeline to your Rails application? TTB Software has been building AI-powered features on Rails for years. We know where the edges are. Get in touch.
About the Author
Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.
Get in TouchRelated Articles
Need Expert Rails Development?
Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise
Schedule a Free Consultation