35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English 35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English
Streaming LLM Responses in Rails: Stop Making Users Stare at a Spinner

Streaming LLM Responses in Rails: Stop Making Users Stare at a Spinner

Roger Heykoop
AI in Rails, Ruby on Rails
Add real-time streaming to your Rails LLM features using Action Controller::Live and Server-Sent Events. Practical code, Puma thread setup, and nginx config.

A client called me last October with a complaint that should have been a compliment. They’d integrated a GPT-4o powered document analyzer into their Rails app — something I’d helped them build. Users were clicking “Analyze” and then staring at a white box for twelve seconds before the full response appeared. “Users think it’s broken,” the CTO told me. “Half of them are refreshing the page.”

The analysis was genuinely good. The user experience was genuinely terrible. The fix was two hours of work.

The problem was synchronous response handling. The OpenAI API was streaming tokens as fast as it could generate them, but the Rails controller was waiting for the entire response, saving it to the database, then rendering it. That’s the right architecture for batch processing. It’s the wrong architecture for anything a human is watching in real time.

Streaming fixes it. The first token appears on screen within a second. Users see progress. They stop refreshing.

The Transport Options

Three patterns for pushing server data to a browser in real time:

WebSockets — bidirectional, stateful, persistent connection. Great for multi-user chat. Significant overhead for “show the LLM output to one user.”

Long polling — browser makes a request, server holds it open, responds when data is ready. Works everywhere, awkward to implement cleanly, not genuinely streaming.

Server-Sent Events (SSE) — one-way, HTTP-based, browser-native. The browser opens a connection and the server pushes events as they arrive. Perfect for LLM streaming, where all the data flows in one direction.

SSE also maps cleanly to how OpenAI’s own streaming API works under the hood. The mental model is a direct translation.

Action Controller::Live

Rails has had SSE support since version 4.0 via ActionController::Live. It never became fashionable — the async web hype train moved on to JavaScript frameworks — but it’s well-maintained, production-tested, and requires zero additional infrastructure.

The basic pattern:

class AnalysisController < ApplicationController
  include ActionController::Live

  def stream
    response.headers["Content-Type"] = "text/event-stream"
    response.headers["Cache-Control"] = "no-cache"
    response.headers["X-Accel-Buffering"] = "no" # critical for nginx

    sse = ActionController::Live::SSE.new(response.stream, retry: 300, event: "message")

    begin
      sse.write({ status: "started" })
      # ... push your data here
    rescue ActionController::Live::ClientDisconnected
      # User navigated away — normal, not an error
    ensure
      sse.close
    end
  end
end

The X-Accel-Buffering: no header is easy to miss. Without it, nginx will buffer your entire response before forwarding it to the client. Your “streaming” feature doesn’t stream.

Hooking Up OpenAI Streaming

The ruby-openai gem supports streaming via a stream parameter that accepts a proc. Each token that arrives from the API calls your proc immediately:

class AnalysisController < ApplicationController
  include ActionController::Live

  def stream
    response.headers["Content-Type"] = "text/event-stream"
    response.headers["Cache-Control"] = "no-cache"
    response.headers["X-Accel-Buffering"] = "no"

    document = Document.find(params[:id])
    sse = ActionController::Live::SSE.new(response.stream, retry: 300, event: "message")

    begin
      client = OpenAI::Client.new(
        access_token: Rails.application.credentials.openai_api_key
      )

      client.chat(
        parameters: {
          model: "gpt-4o",
          messages: [
            { role: "system", content: "You are a precise document analyst." },
            { role: "user", content: "Analyze this document:\n\n#{document.content}" }
          ],
          stream: proc { |chunk, _bytesize|
            token = chunk.dig("choices", 0, "delta", "content")
            sse.write({ token: token }) if token
          }
        }
      )

      sse.write({ status: "done" })
    rescue ActionController::Live::ClientDisconnected
      # Normal — user navigated away
    rescue => e
      sse.write({ error: "Analysis failed. Please try again." })
      Rails.logger.error("Streaming analysis failed: #{e.message}")
    ensure
      sse.close
    end
  end
end

Wire it up in routes:

# config/routes.rb
resources :documents do
  member do
    get :stream
  end
end

The Frontend

The browser’s native EventSource API handles SSE without any libraries:

function startAnalysis(documentId) {
  const output = document.getElementById("analysis-output");
  output.textContent = "";

  const source = new EventSource(`/documents/${documentId}/stream`);

  source.addEventListener("message", (event) => {
    const data = JSON.parse(event.data);

    if (data.token) {
      output.textContent += data.token;
    }

    if (data.status === "done") {
      source.close();
    }

    if (data.error) {
      output.textContent = data.error;
      source.close();
    }
  });

  source.onerror = () => {
    output.textContent += "\n\n[Connection lost. Refresh to try again.]";
    source.close();
  };
}

If you’re using Turbo, you can append to a <turbo-frame> instead of setting textContent directly — but for streaming output, vanilla DOM manipulation is cleaner. Turbo morphing and partial token streaming don’t mix well without careful handling.

Persisting the Result

Streaming to the browser solves the UX problem, but you probably want to save the completed analysis somewhere. Accumulate the full response in the controller and save it after the stream ends:

full_response = +""  # mutable string — note the +

client.chat(
  parameters: {
    model: "gpt-4o",
    messages: messages,
    stream: proc { |chunk, _bytesize|
      token = chunk.dig("choices", 0, "delta", "content")
      if token
        full_response << token
        sse.write({ token: token })
      end
    }
  }
)

document.update!(analysis: full_response)
sse.write({ status: "done" })

The +"" gives you a mutable String. If you have # frozen_string_literal: true at the top of the file, shovel operator on a frozen string raises a FrozenError. The prefix is easy to forget and infuriating to debug at midnight.

Production: Puma Threads

Here’s what catches people in production: ActionController::Live holds a Puma thread open for the entire duration of the stream. A ten-second OpenAI response occupies one thread for ten seconds.

Puma’s default thread pool is min: 5, max: 5. With five concurrent users triggering LLM streams, every thread is occupied. Request number six queues. Response times for your entire application degrade.

Options:

Increase the thread pool. Fine up to a point. Memory grows roughly 100–200MB per Ruby thread in a typical Rails process:

# config/puma.rb
threads_count = ENV.fetch("RAILS_MAX_THREADS", 20)
threads threads_count, threads_count

Dedicated streaming process. Route /documents/:id/stream to a separate Puma process or dyno with a larger thread pool. Heroku, Render, and Fly all support multiple process types in the same app. Your main app stays responsive; the streaming process absorbs the blocking threads.

Polling via background jobs. A Solid Queue or Sidekiq job calls the LLM, stores tokens in Redis, and a lightweight polling endpoint drains them into an SSE response. More infrastructure, more complexity — worth it at high volume (hundreds of concurrent streams).

For most applications — fewer than 100 concurrent LLM streams — increasing the Puma thread pool to 15–20 and deploying it on hardware with enough RAM is the pragmatic answer. Keep it simple until the numbers tell you otherwise.

nginx Configuration

Set X-Accel-Buffering: no in the controller header and in your nginx config. Proxy buffering has a habit of reasserting itself at the config level even when you’ve set the header:

location /documents {
  proxy_buffering off;
  proxy_cache off;
  proxy_pass http://rails_app;
  proxy_read_timeout 120s;
}

proxy_read_timeout is the one people always miss. nginx defaults to 60 seconds. A long document analysis or multi-step reasoning chain can take longer. Without this, nginx closes the connection mid-stream and the user sees a truncated response with no indication of failure.

Rate Limiting

Once streaming works, rate-limit it. An LLM call that streams for 15 seconds while holding a Puma thread is a far more expensive resource than a fast JSON endpoint. Rack::Attack:

# config/initializers/rack_attack.rb
Rack::Attack.throttle("llm_stream/ip", limit: 5, period: 60) do |req|
  req.ip if req.path.include?("/stream")
end

Five streaming requests per minute per IP is generous for normal use and tight enough to prevent abuse. Adjust based on your actual usage patterns.

When Not to Use This

Streaming isn’t always the right answer. If you’re generating a PDF report, there’s nothing to stream — you need the complete response before you can do anything with it. If you’re running batch enrichments in a background job, streaming to the browser makes no sense.

Streaming matters when a human is actively watching and waiting. Document analysis, AI writing assistance, code review, question answering — anything where perceived responsiveness affects whether users trust the feature. When I swapped synchronous response handling for streaming on that document analyzer, the “it feels broken” support requests stopped. The actual latency was identical. The experience was not.

After nineteen years of building Rails applications, I keep rediscovering the same lesson: users tolerate slow processes they can observe. They don’t tolerate fast processes that look frozen.


Frequently Asked Questions

Does this work with Anthropic’s Claude or other providers?

Yes. The anthropic Ruby gem supports streaming with the same proc-based pattern:

client = Anthropic::Client.new(
  api_key: Rails.application.credentials.anthropic_api_key
)

client.messages(
  model: "claude-opus-4-6",
  max_tokens: 2048,
  messages: messages,
  stream: proc { |event|
    token = event.dig("delta", "text")
    sse.write({ token: token }) if token
  }
)

Most LLM providers that offer a streaming API follow the same token-by-token delivery model.

Can I use Hotwire Turbo Streams instead of EventSource?

You can, but it’s awkward. Turbo Streams expect complete HTML fragments, while token streaming delivers partial strings continuously. The cleaner pattern is to use EventSource for the stream, append tokens directly to a DOM element, and use Turbo only for the final “save and refresh” step after the stream completes.

What happens when the user closes the tab mid-stream?

ActionController::Live raises ActionController::Live::ClientDisconnected. Rescue it — but don’t log it as an error, because it’s completely normal. The LLM API call will continue on the provider’s end regardless. There’s no way to cancel an in-flight streaming call with the current ruby-openai gem, so you’ll pay for the full token count whether the user waits or not. This is worth knowing before you build features that trigger expensive long-form generation.

Is Server-Sent Events supported in all browsers?

All major browsers have supported SSE since 2012. The notable exception is IE11, which reached end-of-life in 2022. If you need IE11 support in 2026, you have larger problems than streaming LLM responses.

Building AI features into your Rails application? TTB Software has shipped several production LLM integrations on Rails — document analysis, RAG pipelines, AI-assisted workflows. We know where the edges are. Get in touch.

#rails #llm #streaming #action-controller-live #sse #openai #ai
R

About the Author

Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.

Get in Touch

Share this article

Need Expert Rails Development?

Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise

Schedule a Free Consultation