Streaming LLM Responses in Rails: Stop Making Users Stare at a Spinner
A client called me last October with a complaint that should have been a compliment. They’d integrated a GPT-4o powered document analyzer into their Rails app — something I’d helped them build. Users were clicking “Analyze” and then staring at a white box for twelve seconds before the full response appeared. “Users think it’s broken,” the CTO told me. “Half of them are refreshing the page.”
The analysis was genuinely good. The user experience was genuinely terrible. The fix was two hours of work.
The problem was synchronous response handling. The OpenAI API was streaming tokens as fast as it could generate them, but the Rails controller was waiting for the entire response, saving it to the database, then rendering it. That’s the right architecture for batch processing. It’s the wrong architecture for anything a human is watching in real time.
Streaming fixes it. The first token appears on screen within a second. Users see progress. They stop refreshing.
The Transport Options
Three patterns for pushing server data to a browser in real time:
WebSockets — bidirectional, stateful, persistent connection. Great for multi-user chat. Significant overhead for “show the LLM output to one user.”
Long polling — browser makes a request, server holds it open, responds when data is ready. Works everywhere, awkward to implement cleanly, not genuinely streaming.
Server-Sent Events (SSE) — one-way, HTTP-based, browser-native. The browser opens a connection and the server pushes events as they arrive. Perfect for LLM streaming, where all the data flows in one direction.
SSE also maps cleanly to how OpenAI’s own streaming API works under the hood. The mental model is a direct translation.
Action Controller::Live
Rails has had SSE support since version 4.0 via ActionController::Live. It never became fashionable — the async web hype train moved on to JavaScript frameworks — but it’s well-maintained, production-tested, and requires zero additional infrastructure.
The basic pattern:
class AnalysisController < ApplicationController
include ActionController::Live
def stream
response.headers["Content-Type"] = "text/event-stream"
response.headers["Cache-Control"] = "no-cache"
response.headers["X-Accel-Buffering"] = "no" # critical for nginx
sse = ActionController::Live::SSE.new(response.stream, retry: 300, event: "message")
begin
sse.write({ status: "started" })
# ... push your data here
rescue ActionController::Live::ClientDisconnected
# User navigated away — normal, not an error
ensure
sse.close
end
end
end
The X-Accel-Buffering: no header is easy to miss. Without it, nginx will buffer your entire response before forwarding it to the client. Your “streaming” feature doesn’t stream.
Hooking Up OpenAI Streaming
The ruby-openai gem supports streaming via a stream parameter that accepts a proc. Each token that arrives from the API calls your proc immediately:
class AnalysisController < ApplicationController
include ActionController::Live
def stream
response.headers["Content-Type"] = "text/event-stream"
response.headers["Cache-Control"] = "no-cache"
response.headers["X-Accel-Buffering"] = "no"
document = Document.find(params[:id])
sse = ActionController::Live::SSE.new(response.stream, retry: 300, event: "message")
begin
client = OpenAI::Client.new(
access_token: Rails.application.credentials.openai_api_key
)
client.chat(
parameters: {
model: "gpt-4o",
messages: [
{ role: "system", content: "You are a precise document analyst." },
{ role: "user", content: "Analyze this document:\n\n#{document.content}" }
],
stream: proc { |chunk, _bytesize|
token = chunk.dig("choices", 0, "delta", "content")
sse.write({ token: token }) if token
}
}
)
sse.write({ status: "done" })
rescue ActionController::Live::ClientDisconnected
# Normal — user navigated away
rescue => e
sse.write({ error: "Analysis failed. Please try again." })
Rails.logger.error("Streaming analysis failed: #{e.message}")
ensure
sse.close
end
end
end
Wire it up in routes:
# config/routes.rb
resources :documents do
member do
get :stream
end
end
The Frontend
The browser’s native EventSource API handles SSE without any libraries:
function startAnalysis(documentId) {
const output = document.getElementById("analysis-output");
output.textContent = "";
const source = new EventSource(`/documents/${documentId}/stream`);
source.addEventListener("message", (event) => {
const data = JSON.parse(event.data);
if (data.token) {
output.textContent += data.token;
}
if (data.status === "done") {
source.close();
}
if (data.error) {
output.textContent = data.error;
source.close();
}
});
source.onerror = () => {
output.textContent += "\n\n[Connection lost. Refresh to try again.]";
source.close();
};
}
If you’re using Turbo, you can append to a <turbo-frame> instead of setting textContent directly — but for streaming output, vanilla DOM manipulation is cleaner. Turbo morphing and partial token streaming don’t mix well without careful handling.
Persisting the Result
Streaming to the browser solves the UX problem, but you probably want to save the completed analysis somewhere. Accumulate the full response in the controller and save it after the stream ends:
full_response = +"" # mutable string — note the +
client.chat(
parameters: {
model: "gpt-4o",
messages: messages,
stream: proc { |chunk, _bytesize|
token = chunk.dig("choices", 0, "delta", "content")
if token
full_response << token
sse.write({ token: token })
end
}
}
)
document.update!(analysis: full_response)
sse.write({ status: "done" })
The +"" gives you a mutable String. If you have # frozen_string_literal: true at the top of the file, shovel operator on a frozen string raises a FrozenError. The prefix is easy to forget and infuriating to debug at midnight.
Production: Puma Threads
Here’s what catches people in production: ActionController::Live holds a Puma thread open for the entire duration of the stream. A ten-second OpenAI response occupies one thread for ten seconds.
Puma’s default thread pool is min: 5, max: 5. With five concurrent users triggering LLM streams, every thread is occupied. Request number six queues. Response times for your entire application degrade.
Options:
Increase the thread pool. Fine up to a point. Memory grows roughly 100–200MB per Ruby thread in a typical Rails process:
# config/puma.rb
threads_count = ENV.fetch("RAILS_MAX_THREADS", 20)
threads threads_count, threads_count
Dedicated streaming process. Route /documents/:id/stream to a separate Puma process or dyno with a larger thread pool. Heroku, Render, and Fly all support multiple process types in the same app. Your main app stays responsive; the streaming process absorbs the blocking threads.
Polling via background jobs. A Solid Queue or Sidekiq job calls the LLM, stores tokens in Redis, and a lightweight polling endpoint drains them into an SSE response. More infrastructure, more complexity — worth it at high volume (hundreds of concurrent streams).
For most applications — fewer than 100 concurrent LLM streams — increasing the Puma thread pool to 15–20 and deploying it on hardware with enough RAM is the pragmatic answer. Keep it simple until the numbers tell you otherwise.
nginx Configuration
Set X-Accel-Buffering: no in the controller header and in your nginx config. Proxy buffering has a habit of reasserting itself at the config level even when you’ve set the header:
location /documents {
proxy_buffering off;
proxy_cache off;
proxy_pass http://rails_app;
proxy_read_timeout 120s;
}
proxy_read_timeout is the one people always miss. nginx defaults to 60 seconds. A long document analysis or multi-step reasoning chain can take longer. Without this, nginx closes the connection mid-stream and the user sees a truncated response with no indication of failure.
Rate Limiting
Once streaming works, rate-limit it. An LLM call that streams for 15 seconds while holding a Puma thread is a far more expensive resource than a fast JSON endpoint. Rack::Attack:
# config/initializers/rack_attack.rb
Rack::Attack.throttle("llm_stream/ip", limit: 5, period: 60) do |req|
req.ip if req.path.include?("/stream")
end
Five streaming requests per minute per IP is generous for normal use and tight enough to prevent abuse. Adjust based on your actual usage patterns.
When Not to Use This
Streaming isn’t always the right answer. If you’re generating a PDF report, there’s nothing to stream — you need the complete response before you can do anything with it. If you’re running batch enrichments in a background job, streaming to the browser makes no sense.
Streaming matters when a human is actively watching and waiting. Document analysis, AI writing assistance, code review, question answering — anything where perceived responsiveness affects whether users trust the feature. When I swapped synchronous response handling for streaming on that document analyzer, the “it feels broken” support requests stopped. The actual latency was identical. The experience was not.
After nineteen years of building Rails applications, I keep rediscovering the same lesson: users tolerate slow processes they can observe. They don’t tolerate fast processes that look frozen.
Frequently Asked Questions
Does this work with Anthropic’s Claude or other providers?
Yes. The anthropic Ruby gem supports streaming with the same proc-based pattern:
client = Anthropic::Client.new(
api_key: Rails.application.credentials.anthropic_api_key
)
client.messages(
model: "claude-opus-4-6",
max_tokens: 2048,
messages: messages,
stream: proc { |event|
token = event.dig("delta", "text")
sse.write({ token: token }) if token
}
)
Most LLM providers that offer a streaming API follow the same token-by-token delivery model.
Can I use Hotwire Turbo Streams instead of EventSource?
You can, but it’s awkward. Turbo Streams expect complete HTML fragments, while token streaming delivers partial strings continuously. The cleaner pattern is to use EventSource for the stream, append tokens directly to a DOM element, and use Turbo only for the final “save and refresh” step after the stream completes.
What happens when the user closes the tab mid-stream?
ActionController::Live raises ActionController::Live::ClientDisconnected. Rescue it — but don’t log it as an error, because it’s completely normal. The LLM API call will continue on the provider’s end regardless. There’s no way to cancel an in-flight streaming call with the current ruby-openai gem, so you’ll pay for the full token count whether the user waits or not. This is worth knowing before you build features that trigger expensive long-form generation.
Is Server-Sent Events supported in all browsers?
All major browsers have supported SSE since 2012. The notable exception is IE11, which reached end-of-life in 2022. If you need IE11 support in 2026, you have larger problems than streaming LLM responses.
Building AI features into your Rails application? TTB Software has shipped several production LLM integrations on Rails — document analysis, RAG pipelines, AI-assisted workflows. We know where the edges are. Get in touch.
About the Author
Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.
Get in TouchRelated Articles
Need Expert Rails Development?
Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise
Schedule a Free Consultation