Rails Puma Tuning: Workers, Threads, Memory and Concurrency for Production Performance
A client called me in February because their Rails app had started OOM-killing itself every twenty minutes. They had just doubled the number of Puma workers “to handle more traffic.” The machine had eight gigabytes of RAM. Each worker was sitting at 900MB. Twelve workers times 900MB is the kind of math that does not survive contact with Linux. I spent an hour on their Puma config and we got through the afternoon peak without a single restart.
After nineteen years of Rails I have seen more production outages caused by misconfigured Puma than by any other single piece of the stack. Sidekiq is usually fine. Postgres misbehaves loudly enough that you notice. Puma fails quietly, in the shape of 502s and slow responses that look like your application being generally bad. This post is the Rails Puma tuning guide I walk every client through when we audit their production config.
What Rails Puma Tuning Actually Controls
Puma is the default web server for Rails since version 5, and it runs your application across two dimensions of concurrency: processes (called “workers” in Puma) and threads inside each process. Rails Puma tuning is the art of picking how many of each you run, how much memory you let them eat, and how they share resources with everything else on the box.
The knobs that matter are small in number:
workers— how many forked processes Puma runs.threads— the minimum and maximum thread pool size inside each worker.preload_app!— whether the parent process loads your application before forking, enabling copy-on-write memory sharing.worker_timeout— how long a request can stall before Puma kills the worker.nakayoshi_forkand out-of-band GC — legacy settings that mostly do not matter in Ruby 3.3+.
Everything else is detail. Get these right and your p95 latency and memory footprint sort themselves out.
Rails Puma Tuning: Workers vs Threads Decision
The first decision every team has to make is how to split concurrency between workers and threads. The honest answer depends on the GVL — the Global VM Lock — and on what your application spends its time doing.
Ruby threads inside a single process cannot run Ruby code simultaneously because of the GVL. They can run concurrently when one of them is blocked on I/O: a database query, an HTTP call to a third party, a Redis lookup. Workers are separate OS processes with their own GVL, so they run Ruby code in parallel on different CPU cores.
The decision rule I give clients:
- If your app is I/O-bound — which most Rails apps are, because most requests are waiting on Postgres — lean on threads. They are cheap, they share memory, and they soak up I/O wait nicely.
- If your app is CPU-bound — rendering huge JSON, crunching reports, running PDF generation inline — you need workers. Threads will just queue behind the GVL.
- Most real Rails apps are a mix, which is why the standard answer is “a few workers, a few threads each.”
For a typical Rails API backed by Postgres with a mix of cached and uncached endpoints, I start at workers = number of CPU cores, threads = 3 to 5 per worker, and tune from there.
Setting Worker Count on Modern Hardware
The naive rule “one worker per CPU core” is usually right, but you have to know which CPUs you actually have.
# config/puma.rb
workers ENV.fetch("WEB_CONCURRENCY") { Etc.nprocessors }
On a dedicated VM with four vCPUs, four workers is correct. On a shared Kubernetes pod with a CPU request of 500m and a limit of 2, four workers is wrong — the scheduler will throttle you and you will see context-switch latency that looks like application slowness.
On containers, I use this pattern instead:
# config/puma.rb
def available_cpu_count
quota = File.read("/sys/fs/cgroup/cpu.max").split.first rescue "max"
return Etc.nprocessors if quota == "max"
period = File.read("/sys/fs/cgroup/cpu.max").split.last.to_f
[(quota.to_f / period).ceil, 1].max
end
workers ENV.fetch("WEB_CONCURRENCY") { available_cpu_count }
That reads the cgroup v2 CPU quota and uses it as the worker count. Under Kubernetes with a CPU limit of 2, you get two workers, not the sixteen the host actually has. This change alone has saved clients more production pages than any other single thing I have done with Puma.
If you are running on Kamal or bare VMs, Etc.nprocessors is fine because the host sees the real CPU count. I covered the deploy story in the Kamal 2 guide.
Setting Thread Count Per Worker
Thread count is where teams over-tune. The default in config/puma.rb has historically been min: 5, max: 5, which is fine for most applications. I rarely set it higher than 10.
# config/puma.rb
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }.to_i
threads threads_count, threads_count
Using the same value for min and max is deliberate. A Puma worker that grows and shrinks its thread pool wastes memory because threads in Ruby do not release their stack allocations cleanly. Pin it.
The ceiling on thread count is your database connection pool. If you have five threads per worker and four workers, you have twenty connections per Puma process group — plus background jobs, plus the console, plus whatever else. The database connection pool must be at least equal to RAILS_MAX_THREADS:
# config/database.yml
production:
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
checkout_timeout: 5
And the Postgres server has to have enough connections for every worker in every box. If you are close to your Postgres max_connections limit, put pgbouncer in front — I wrote about that pattern in the pgbouncer guide.
A related mistake: running more threads than your slowest external service can handle. If your third-party API times out at 10 seconds and you have 20 threads per worker, a brief provider outage will saturate your entire Puma. Thread pools are a resource; treat them that way.
Memory Budgeting: The Real Constraint
The limiting factor on most Rails production servers is not CPU. It is memory. Every Puma worker is an independent Ruby process with its own heap, its own compiled code cache, and its own loaded gems. A fresh-forked worker on a decent-sized Rails app is usually 250–400MB. After an hour of serving traffic, it is often 700MB to 1.2GB.
The formula I use for capacity planning:
max_memory_per_box = (worker_count * observed_worker_rss) + overhead
Where observed_worker_rss is what your workers grow to after warmup, not when they just started. I measure this with a ten-minute load test against staging. The overhead is usually 500MB–1GB for the OS, the reverse proxy, Sidekiq, and whatever else shares the box.
On that client from the opening story, the math was:
12 workers * 900MB + 500MB overhead = 11.3GB needed
Box had 8GB → OOM killer → 502s
We cut the worker count to four, raised threads from three to five per worker, and the box sat comfortably at 5GB. P95 latency improved because fewer workers were being killed and restarted.
The answer to “should I add more memory?” is almost always yes if you are on cloud hardware. The answer to “should I add more workers?” is almost always no unless you have measured CPU saturation. Teams flip these two questions in their heads constantly.
Copy-on-Write and preload_app
Ruby has supported copy-on-write friendly garbage collection since 2.0. Puma’s preload_app! directive leverages this: the parent process loads your Rails application once, then forks workers that share the loaded code pages with the parent. Until a page is written to, it is not duplicated in RAM.
# config/puma.rb
preload_app!
before_fork do
ActiveRecord::Base.connection_pool.disconnect!
end
on_worker_boot do
ActiveRecord::Base.establish_connection
end
The memory savings are real. On applications I have measured, preload_app! saves 150–300MB per worker after warmup. With four workers, that is a gigabyte of RAM you do not have to buy.
The two footguns are on the fork boundary. Any socket, thread, or connection opened in the parent becomes invalid in the child. The common ones:
- Database connections. Disconnect in
before_fork, reconnect inon_worker_boot. - Redis connections. If you use Sidekiq’s Redis from the Rails process — for rate limiting, for example — reconnect the same way.
- Background threads started at boot. Any gem that spawns a thread during eager loading (New Relic, Datadog, some telemetry SDKs) needs to re-initialize post-fork. The well-behaved ones handle this automatically; the badly behaved ones silently stop reporting.
If you enable preload_app! and your monitoring goes silent, check the fork hooks. I have lost days to this.
Practical Puma Configuration for Production
Here is the config/puma.rb I actually ship. Tune the numbers but keep the structure.
# config/puma.rb
max_threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }.to_i
min_threads_count = ENV.fetch("RAILS_MIN_THREADS") { max_threads_count }.to_i
threads min_threads_count, max_threads_count
workers ENV.fetch("WEB_CONCURRENCY") { Etc.nprocessors }
preload_app!
port ENV.fetch("PORT") { 3000 }
environment ENV.fetch("RAILS_ENV") { "production" }
pidfile ENV.fetch("PIDFILE") { "tmp/pids/server.pid" }
worker_timeout 30
worker_shutdown_timeout 30
before_fork do
ActiveRecord::Base.connection_pool.disconnect!
end
on_worker_boot do
ActiveRecord::Base.establish_connection
end
plugin :tmp_restart
lowlevel_error_handler do |exception, env|
Rails.error.report(exception, context: { path: env["PATH_INFO"] }, handled: false)
[500, { "Content-Type" => "text/plain" }, ["Internal Server Error"]]
end
A few details worth calling out:
worker_timeout 30is aggressive and intentional. If a request takes longer than thirty seconds, something is wrong and killing the worker is better than queueing behind it. Move long work to background jobs — I have a whole post on Solid Queue for that.worker_shutdown_timeout 30gives in-flight requests time to finish during a deploy. Set this to match your longest acceptable request.lowlevel_error_handlercatches errors Puma raises before your Rails middleware — mostly request parsing failures and slowloris-style misbehavior. Log them so you can tell apart genuine attacks from misbehaving clients.
Monitoring and Iterating
You cannot tune what you do not measure. The four signals I watch on every production Puma:
- Worker RSS over time — is memory flat or growing? A steady climb means a leak. I covered how to hunt those in the memory leak guide.
- Busy thread count per worker — are you using the threads you configured? If peak traffic leaves threads idle, you have room to cut workers and raise threads.
- Request queue time — time from accept to handoff. If this grows while response time is flat, you are under-provisioned at the worker level.
- Puma backlog — how many requests are waiting. The Puma control app exposes this at
/statsif you enable it.
# config/puma.rb
activate_control_app "unix:///tmp/puma_control.sock", auth_token: ENV["PUMA_CONTROL_TOKEN"]
You can then hit the socket with pumactl or scrape it into your metrics system. Anything that shows up as queue time is latency your users feel but your APM might attribute to “application.”
If you use OpenTelemetry, the Rack instrumentation surfaces queue time as a span attribute. I wrote about wiring that up in the OpenTelemetry guide.
Common Rails Puma Tuning Mistakes
The same mistakes come up in almost every audit.
Setting WEB_CONCURRENCY without understanding the hardware. Three workers on a one-vCPU container is worse than one worker. Measure before you multiply.
Letting RAILS_MAX_THREADS drift away from the database pool. Every deploy one of them moves and the other does not, and then one busy afternoon you get ActiveRecord::ConnectionTimeoutError in production. Lock them to the same ENV var.
Enabling preload_app! without fixing the fork hooks. You get a memory improvement and a silent monitoring outage. Always test the hooks in staging before shipping.
Treating Puma as the bottleneck when it is actually Postgres. Nine times out of ten, a “slow Puma” is a Puma with all threads stuck on a slow query. Fix the query before you add workers.
Running Puma in cluster mode on a tiny box. If you have less than about 1.5GB of RAM to give your web tier, run Puma in single mode (no workers). You will lose a little isolation but spend a lot less on overhead.
FAQ
How many Puma workers and threads should a Rails app use?
Start with workers equal to your CPU count and 5 threads per worker. Measure. If your app is I/O-bound and threads sit idle at peak, reduce workers. If it is CPU-bound and response times grow under load, add workers and lower threads. Most Rails apps land at 2–4 workers with 5 threads each on a small production box.
Does Rails Puma tuning still matter with YJIT?
Yes. YJIT improves per-request CPU efficiency — I wrote about it in the YJIT guide — but it does not change how workers and threads share memory or how many requests can run in parallel. YJIT lets each thread do more work per second. Puma configuration still decides how many threads you have.
How much memory does a Puma worker use in production?
A typical Rails 7 or Rails 8 application runs at 250–400MB per worker cold and 600MB to 1.2GB warm, depending on gems and workload. Measure your own — a staging load test for ten minutes gives you a realistic number. Use that, not a blog post’s number, for capacity planning.
Should I use preload_app! in production?
Almost always yes. It saves real memory through copy-on-write. The only reason to turn it off is when you are debugging a fork-safety issue in a dependency and cannot fix it quickly. Even then, fix the dependency and re-enable preload_app! — single-digit percentage memory savings compound across a fleet.
Running Rails and tired of guessing at Puma settings? TTB Software specializes in performance and operations for Rails applications — from Puma tuning to Postgres, background jobs, and observability. We’ve been doing this for nineteen years.
About the Author
Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.
Get in TouchRelated Articles
Need Expert Rails Development?
Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise
Schedule a Free Consultation