RUBY ON RAILS · 17 MIN READ ·

Streaming Claude Responses in Rails: SSE, Turbo Streams, and Real-Time AI Chat

Stream Claude responses in Rails with SSE and Turbo Streams. Token-by-token AI chat UI, backpressure, reconnects, and production patterns that scale.

Streaming Claude Responses in Rails: SSE, Turbo Streams, and Real-Time AI Chat

A SaaS founder showed me a demo of his new AI assistant last September. He hit submit, the page froze for eleven seconds, and then the entire answer appeared at once. He looked at me and said, “It’s fine, right? Eleven seconds is fast for an LLM.” It was not fine. Every competitor he was up against had a chat UI that started typing within 400ms, and his customers were closing the tab before his response rendered. We spent the next afternoon converting his single blocking controller action into a streaming endpoint backed by Rails streaming Claude responses through Server-Sent Events into Turbo Streams. Time to first token dropped to 380ms. Trial-to-paid conversion on the AI feature doubled the next month.

After nineteen years of Rails I have shipped a lot of long-running responses — from large CSV exports to Postgres COPY pipelines — and none of them have a user experience problem quite as acute as an LLM call. A two-second SQL query feels acceptable. A two-second wait for an AI answer feels broken. This post is the production pattern I now use for every Rails app that talks to Claude: how to stream tokens from the Anthropic API into the browser, render them through Turbo Streams, and keep the connection alive through proxies, timeouts, and reconnects.

Why Streaming Claude Responses in Rails Matters

A non-streamed Claude call in Rails looks deceptively simple. You call the API, wait for the full response, render it. The problem is that even a fast Sonnet 4.6 response that finishes in three seconds feels like a hang, because nothing changes on screen for the full three seconds. Users have been trained by ChatGPT and Claude.ai to see characters appear as the model generates them. Anything slower feels broken.

Streaming fixes three things at once: perceived latency, error tolerance, and cost transparency. Time to first token is usually under 500ms even for long responses. If the model takes a wrong turn at token 200, your user sees it and can cancel rather than waiting for the full 2000-token response. And the gradual reveal naturally communicates “this is being generated right now,” which sets the right expectations.

The Rails-specific challenge is that streaming requires holding an HTTP connection open for the duration of the response. That works against most of Rails’ default assumptions — request middleware buffers the body, Puma workers are precious, and your reverse proxy probably buffers responses too. The pattern I will walk through threads through all of those.

The Anthropic Streaming API in Ruby

Anthropic’s Messages API supports streaming via stream: true, which switches the response from a single JSON body to a stream of Server-Sent Events. The official anthropic Ruby SDK exposes this through a block-based API.

require "anthropic"

client = Anthropic::Client.new(api_key: ENV.fetch("ANTHROPIC_API_KEY"))

client.messages.stream(
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  messages: [
    { role: "user", content: "Explain Rails Server-Sent Events in three paragraphs." }
  ]
) do |event|
  case event.type
  when "content_block_delta"
    print event.delta.text
  when "message_stop"
    puts "\n--- done ---"
  end
end

The SDK yields typed events for every stage: message_start, content_block_start, content_block_delta (the actual tokens), content_block_stop, message_delta (with usage info), and message_stop. Ninety percent of the time you only care about content_block_delta for the text and message_stop to know when to close the channel.

If you are using the Anthropic prompt caching pattern with cached system prompts, streaming works identically — the cache hit shows up in the final message_delta usage event, but the streaming behavior is unchanged. Always combine streaming with caching in production; you want both the latency win and the cost win.

Hooking SSE Into Rails With ActionController::Live

Rails has shipped streaming responses since version 4 via ActionController::Live. This is the piece most tutorials get wrong: people reach for ActionCable or a separate Node service when plain Live controllers are usually enough.

class ChatStreamsController < ApplicationController
  include ActionController::Live

  def create
    response.headers["Content-Type"] = "text/event-stream"
    response.headers["Cache-Control"] = "no-cache"
    response.headers["X-Accel-Buffering"] = "no"
    response.headers["Connection"] = "keep-alive"

    sse = SSE.new(response.stream, retry: 3000, event: "delta")
    conversation = Current.user.conversations.find(params[:conversation_id])
    message = conversation.messages.create!(role: "user", content: params[:prompt])
    assistant_message = conversation.messages.create!(role: "assistant", content: "")

    client = Anthropic::Client.new(api_key: ENV.fetch("ANTHROPIC_API_KEY"))

    client.messages.stream(
      model: "claude-sonnet-4-6",
      max_tokens: 2048,
      system: conversation.system_prompt,
      messages: conversation.api_messages
    ) do |event|
      case event.type
      when "content_block_delta"
        assistant_message.append_text!(event.delta.text)
        sse.write({ message_id: assistant_message.id, delta: event.delta.text })
      when "message_stop"
        sse.write({ message_id: assistant_message.id, done: true }, event: "done")
      end
    end
  rescue ActionController::Live::ClientDisconnected
    Rails.logger.info("Client disconnected mid-stream for message #{assistant_message&.id}")
  rescue Anthropic::APIError => e
    sse.write({ error: e.message }, event: "error")
  ensure
    sse&.close
    response.stream.close
  end
end

A few details that matter in production:

The X-Accel-Buffering: no header is what tells nginx not to buffer the response. Without it, your tokens will sit in the proxy buffer until enough have accumulated to justify a flush, defeating the entire point. The same setting matters for Caddy (flush_interval -1) and most cloud load balancers.

The append_text! method is doing the work of persisting the assistant message incrementally. I implement it as a thin wrapper that updates the content column and broadcasts to any other listeners — that way a user with the conversation open in a second tab also sees the message grow. We will use this in the next section.

The rescue for ClientDisconnected is critical. If a user closes the tab mid-stream, you do not want a stack trace; you want to log it and stop. The Anthropic::APIError rescue catches things like rate limits and surfaces them to the client.

Connecting the Stream to Turbo Streams in the UI

Server-Sent Events on their own are usable — the browser’s native EventSource API can receive them — but stitching them into the DOM yourself is exactly the kind of code you do not want to maintain. Turbo Streams already knows how to apply DOM operations from the server, so I let the controller emit Turbo Stream actions directly.

There are two patterns. For app-wide chat state that any tab should see, use ActionCable broadcasts driven by Turbo::StreamsChannel. For a single user’s in-flight request, use a dedicated SSE endpoint that emits Turbo Stream frames. I prefer the second for chat because you do not pay the ActionCable overhead and the lifecycle of the stream exactly matches the lifecycle of the request.

class ChatStreamsController < ApplicationController
  include ActionController::Live
  include Turbo::Streams::ActionHelper

  def create
    response.headers["Content-Type"] = "text/event-stream"
    response.headers["X-Accel-Buffering"] = "no"

    conversation = Current.user.conversations.find(params[:conversation_id])
    assistant_message = conversation.messages.create!(role: "assistant", content: "")

    response.stream.write(turbo_stream_action_tag(
      "append",
      target: "messages",
      template: render_to_string(partial: "messages/message", locals: { message: assistant_message })
    ))

    Anthropic::Client.new.messages.stream(stream_params(conversation)) do |event|
      next unless event.type == "content_block_delta"

      assistant_message.append_text!(event.delta.text)
      response.stream.write(turbo_stream_action_tag(
        "append",
        target: "message_#{assistant_message.id}_content",
        template: ERB::Util.html_escape(event.delta.text)
      ))
    end
  ensure
    response.stream.close
  end
end

On the client side you need a tiny Stimulus controller (see Stimulus controllers production patterns for the broader pattern) that opens the SSE connection and pipes the Turbo Stream frames into Turbo’s renderStreamMessage.

import { Controller } from "@hotwired/stimulus"
import { renderStreamMessage } from "@hotwired/turbo"

export default class extends Controller {
  static values = { url: String }

  connect() {
    this.source = new EventSource(this.urlValue)
    this.source.onmessage = (event) => renderStreamMessage(event.data)
    this.source.addEventListener("done", () => this.source.close())
    this.source.addEventListener("error", () => this.source.close())
  }

  disconnect() {
    this.source?.close()
  }
}

That is the full pipeline. The user types, the form posts a prompt, the controller opens a stream, Claude tokens flow back, each content_block_delta becomes a Turbo Stream append action, the browser appends the text into a live <div>. Time to first visible character is under 500ms in production.

Backpressure, Reconnects, and Partial Messages

The naive version breaks in three places once you ship it.

Backpressure happens when the client is slow to read but the server keeps writing. The TCP buffer fills, your Puma thread blocks on the write, and one slow user can pin a worker. The fix is to set a write timeout on response.stream and abort the stream if it exceeds it. The Anthropic SDK’s block-based API plays well with this because raising inside the block cleanly aborts the upstream connection too.

response.stream.instance_variable_set(:@write_timeout, 5)

Reconnects matter because mobile networks drop. The EventSource API will automatically reconnect, sending the Last-Event-ID header so you can resume. The cleanest way to support this is to make every SSE frame include the assistant message’s content position, and on reconnect resume from that offset using the persisted message content rather than re-calling Claude.

def create
  last_event_id = request.headers["Last-Event-ID"].to_i
  message = conversation.messages.find(params[:message_id])

  if last_event_id < message.content.length
    sse.write({ delta: message.content[last_event_id..] }, id: message.content.length)
  end

  if message.completed?
    sse.write({ done: true }, event: "done")
    return
  end

  # otherwise tail the message until it's complete
  follow_in_progress_message(message, sse)
end

This is also why the append_text! pattern persists tokens as they arrive — it makes the assistant message recoverable. Without persistence, a reconnect means re-billing Anthropic for tokens the user already saw. With persistence, reconnects are effectively free.

Partial messages show up when Claude is cut off mid-sentence by a max_tokens limit, a network error, or a user navigating away. You need a completed? flag on the message so you can distinguish “still streaming” from “finished” from “abandoned.” I use three states — streaming, completed, failed — and a background job that cleans up streaming messages older than five minutes.

Production Gotchas

A handful of operational details that the demo videos never mention:

Puma threads. Streaming holds a thread for the duration of the response. If your typical Claude response takes 8 seconds and you run 5 threads per worker, your effective concurrency for streaming endpoints is 5. Right-size by isolating streaming controllers behind a dedicated Puma worker pool or, better, by using a separate Puma binding for ChatStreamsController with a higher thread count and lower max requests. We covered the underlying tuning model in Rails Puma tuning.

Cloud load balancers. AWS ALB has a default 60-second idle timeout. If Claude takes longer than that and there is no traffic, the LB will close the connection. Either keep the connection alive by writing a heartbeat comment every 15 seconds (response.stream.write(": keepalive\n\n")) or bump the idle timeout to 300 seconds.

Rails reloading. In development, code reloading can deadlock long-running streaming requests. Wrap the streaming controller action in Rails.application.executor.wrap so the executor knows to keep the dependencies loaded for the duration. The default ActionController::Live does some of this but not all.

Logging. Streaming responses make Rails’ default request logging useless because the request “ends” only when the stream closes. Add a before_action that records the start time and an after_action that logs the actual duration, token count, and finish reason. You want this data when you debug why one user’s stream took 47 seconds.

Testing. Use request_via_redirect style tests sparingly with streaming; they tend to hang. Instead, write integration tests that mock the Anthropic client at the SDK level, yielding a predetermined sequence of events, and assert on the streamed body chunks.

FAQ

Should I use ActionCable instead of SSE for streaming Claude responses in Rails?

For single-user, single-conversation streaming I prefer SSE because it is simpler — one HTTP request, one stream, automatic browser reconnects, no separate WebSocket server, no channel subscriptions to manage. ActionCable is the right call when you have multi-user broadcasts (a shared collaborative AI chat) or when you are already using WebSockets for other features. With SolidCable in Rails 8 the operational overhead of ActionCable is much lower, but SSE is still less code for the chat use case.

How do I handle Claude tool use and function calling while streaming in Rails?

Tool use complicates streaming because the model emits tool_use blocks that need to be executed before the response continues. The pattern is: stream until you see a content_block_start with type: "tool_use", accumulate the tool input, execute the tool when content_block_stop arrives for that block, then call messages.stream again with the tool result appended to the conversation. This is the same loop covered in Rails AI agents with Claude tool use, just with streaming wrapped around each LLM call.

Can I use Anthropic streaming with prompt caching in Rails?

Yes — they compose cleanly. Set cache_control on your system prompt and any large context blocks exactly as you would for a non-streamed call. The streaming response’s final message_delta event includes usage.cache_read_input_tokens, which is how you verify the cache is hitting. In production I see streaming + caching together cutting both p95 latency and per-conversation cost by 70 to 90 percent on long-system-prompt chat apps.

What happens if a user closes the browser tab while a Claude response is streaming?

Rails raises ActionController::Live::ClientDisconnected on the next write to response.stream. The Anthropic SDK’s block-based API will keep yielding events until you exit the block, but you can break out cleanly and the upstream HTTP connection to Anthropic will close, stopping further token generation. If the assistant message has been persisted incrementally, the conversation history is still intact. If not, you have lost the partial response. Always persist as you stream.

Need help shipping a production-grade AI chat experience in Rails? TTB Software specializes in Rails, Claude integrations, and the operational patterns that keep streaming endpoints reliable at scale. We’ve been doing this for nineteen years.

#rails-streaming-claude #anthropic-streaming-ruby #server-sent-events-rails #turbo-streams-ai-chat #ai-chat-rails #claude-api-ruby #rails-actioncontroller-live

Related Articles

Last section. Then please call.

It's a phone call. That's the worst it can get.

No discovery deck. No 45-minute "qualification" call. 30 minutes, your problem, my opinion. If we're a fit, you'll know by minute 12.

Direct line — answered by Roger
+31 6 5123 6132
Mon–Fri, 09:00–18:00 CET · Currently available

OR
info@ttb.software