Postgres Logical Replication for Rails: Zero-Downtime Major Version Upgrades and Database Moves
Postgres logical replication for Rails: zero-downtime major version upgrades, provider moves, and database splits with publisher, subscriber, slot monitoring.
A fintech client called me on a Wednesday: their RDS Postgres 13 cluster hit end-of-life in eleven weeks, the standard pg_upgrade path required a multi-hour write outage they could not take, and their last attempt at a blue-green database swap had ended with three minutes of dropped writes and a very long postmortem. They wanted to know whether Postgres logical replication could move them to Postgres 17 with truly zero downtime — not “five seconds of read-only mode,” actual zero. We finished the cutover four weeks later. Total write pause during the switchover: 1.6 seconds, well inside their connection retry envelope. No data loss, no support tickets.
After nineteen years of Rails I have done a lot of Postgres upgrades, and the honest answer is that pg_upgrade is fine for hobby projects and the wrong tool for production traffic above a thousand writes per second. Postgres logical replication for Rails is the boring, reliable, slightly tedious way to do major version upgrades, provider moves, region migrations, and database splits without taking your app offline. This post is the playbook I use with clients: how it works, the Rails-specific gotchas (sequences, large objects, bytea columns, replica identity), and the actual cutover sequence that holds writes for under two seconds.
How Postgres Logical Replication Differs From Streaming Replication
Streaming replication ships the write-ahead log byte-for-byte from a primary to a standby. The standby must run the exact same Postgres version, the exact same architecture, and is read-only until you promote it. It is the right tool for high-availability replicas — and the wrong tool for a major version upgrade, because you cannot replicate WAL across major versions.
Postgres logical replication ships decoded changes: “insert this row into this table, here are the column values.” A subscriber on Postgres 17 can consume changes from a publisher on Postgres 13 because both sides speak the logical protocol. The subscriber is a fully writable database that just happens to be receiving a live feed of changes from somewhere else. That is the whole trick.
This is also what makes it suitable for moves between providers (RDS to Aurora, Aurora to self-hosted, Heroku to RDS), between regions, between Postgres versions, and for surgical operations like “split the audit_logs schema onto its own database.” If you have read my earlier piece on Postgres connection pooling with PgBouncer, you already know one of the constraints we will run into during cutover.
The Pre-Flight Audit Every Rails App Needs
Before you touch a publication, spend a day on the audit. Postgres logical replication has sharp edges that bite Rails apps specifically.
Inventory tables that lack a primary key. Logical replication needs a replica identity to know which row to update or delete on the subscriber. By default that is the primary key. Rails gives every model a primary key, so this is usually fine — but join tables created with create_join_table and the legacy schema_migrations table do not. Every table without a primary key needs either one added (preferred) or REPLICA IDENTITY FULL (which works but is brutal for UPDATE/DELETE performance because it logs every column of every changed row).
SELECT n.nspname, c.relname, c.relreplident
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY n.nspname, c.relname;
relreplident is d (default — uses primary key), n (nothing — table cannot be replicated for updates or deletes), f (full), or i (using a specific index). Anything that is n is a blocker.
Inventory sequences. Logical replication does not replicate sequence values. After cutover, every sequence on the subscriber will be at whatever value it had when you initialized it, and your next INSERT on the new database will collide with existing IDs. You have to manually advance sequences during the cutover window. I will show that script below.
Inventory large objects, unlogged tables, and materialized views. Large objects (the lo type, not bytea) are not replicated. Unlogged tables are not replicated. Materialized view contents are not replicated — only the definition. If you have these, you either accept the loss, replicate them separately, or refresh them on the subscriber after cutover.
Inventory DDL. Postgres logical replication does not replicate schema changes. Every ALTER TABLE you run during the migration window has to be run on both sides. Most teams freeze migrations during the cutover, which on a Rails app means freezing deploys with migrations — easier said than done.
Run this script for the audit summary:
# bin/audit_logical_replication
require_relative "../config/environment"
conn = ActiveRecord::Base.connection
puts "=== Tables without primary key ==="
puts conn.exec_query(<<~SQL).rows.map(&:first)
SELECT c.relname FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r' AND n.nspname = 'public'
AND c.relreplident = 'd'
AND NOT EXISTS (
SELECT 1 FROM pg_index i WHERE i.indrelid = c.oid AND i.indisprimary
)
SQL
puts "\n=== Unlogged tables ==="
puts conn.exec_query(<<~SQL).rows.map(&:first)
SELECT relname FROM pg_class
WHERE relkind = 'r' AND relpersistence = 'u'
SQL
puts "\n=== Materialized views ==="
puts conn.exec_query("SELECT matviewname FROM pg_matviews").rows.map(&:first)
puts "\n=== Sequence count ==="
puts conn.exec_query("SELECT count(*) FROM pg_sequences").rows.first.first
Run this on production. Solve everything in the output before you create a single publication.
Setting Up the Publisher and Subscriber
Once the audit is clean, the actual logical replication setup is mechanical. On the publisher (the old database), enable wal_level = logical. This requires a restart — schedule it during a low-traffic window or use your provider’s tooling. On RDS this is a parameter group change and a reboot; on Aurora it is rds.logical_replication = 1.
-- On the publisher (Postgres 13)
ALTER SYSTEM SET wal_level = logical;
-- restart Postgres
CREATE PUBLICATION rails_app_pub FOR ALL TABLES;
-- Replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'redacted';
GRANT CONNECT ON DATABASE rails_app_production TO replicator;
GRANT USAGE ON SCHEMA public TO replicator;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replicator;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO replicator;
On the subscriber (the new Postgres 17 database), first restore the schema only from the publisher. You want every table, index, sequence, and constraint, but no data — the data will arrive via replication.
# Schema only dump from old database
pg_dump --host=old-db --schema-only --no-owner --no-privileges \
--no-publications --no-subscriptions \
rails_app_production > schema.sql
# Restore on new database
psql --host=new-db rails_app_production < schema.sql
Now create the subscription on the new database:
-- On the subscriber (Postgres 17)
CREATE SUBSCRIPTION rails_app_sub
CONNECTION 'host=old-db port=5432 dbname=rails_app_production user=replicator password=redacted'
PUBLICATION rails_app_pub
WITH (copy_data = true, create_slot = true, slot_name = 'rails_app_slot');
That single command does an enormous amount of work. It opens a connection back to the publisher, creates a replication slot, performs an initial COPY of every table to the subscriber, and then begins streaming changes from the slot’s start LSN. For a 200 GB database the initial copy can take several hours. During that time, every row inserted, updated, or deleted on the publisher is buffered in the replication slot, which means WAL on the publisher cannot be cleaned up until the subscriber catches up. Keep an eye on disk space.
Monitoring Replication Lag and Slot Health
The single most dangerous failure mode of Postgres logical replication is an undetected slot that keeps growing. The slot holds WAL on the publisher until the subscriber confirms it. If your subscriber crashes, falls hopelessly behind, or you misconfigure something, the slot grows, eats your disk, and eventually your publisher refuses writes. I have seen this take down production. Monitor it from day one.
On the publisher:
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;
retained_wal is your danger metric. If it grows past a few gigabytes and is not shrinking, the subscriber is not keeping up.
On the subscriber:
SELECT
subname,
pid,
received_lsn,
last_msg_receipt_time,
latest_end_time
FROM pg_stat_subscription;
last_msg_receipt_time being more than ten seconds old means replication has stalled. I wire both of these into Datadog with alerts at 2 GB retained WAL warning and 8 GB critical.
A simple Rails check you can hit from a health endpoint or scheduled job:
class ReplicationHealthCheck
THRESHOLD_BYTES = 2.gigabytes
def self.publisher_lag
sql = <<~SQL
SELECT slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_name = 'rails_app_slot'
SQL
result = ActiveRecord::Base.connection.exec_query(sql).first
return unless result
if result["lag_bytes"] > THRESHOLD_BYTES
Rails.error.report(
ReplicationLagError.new("Slot lag #{result['lag_bytes']} bytes"),
severity: :warning
)
end
end
end
Schedule this every minute. If you have not got a slot-lag alert wired up, you are not ready to cut over.
The Cutover: Holding Writes for 1.6 Seconds
The cutover is the heart of the migration. Done right, the application briefly cannot write, then continues against a new database, with no data loss. Done wrong, you split-brain your data between two databases and spend the weekend reconciling rows.
The sequence I use, in order:
Step 1: Confirm replication lag is essentially zero. pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) should be under a few kilobytes. If it is not, wait.
Step 2: Put the application in write-pause mode. The cleanest way I have found is a Rack middleware that returns 503 Service Unavailable for any non-GET request when a Redis flag is set. Reads continue, the homepage works, but no writes hit Postgres. Clients see a brief retryable error. This is the only window where you have downtime — and it should be measured in seconds, not minutes.
class WritePauseMiddleware
def initialize(app)
@app = app
end
def call(env)
request = ActionDispatch::Request.new(env)
if !request.get? && !request.head? && write_paused?
[503, {"Content-Type" => "application/json", "Retry-After" => "5"},
[{error: "Maintenance in progress, retry shortly"}.to_json]]
else
@app.call(env)
end
end
private
def write_paused?
Rails.cache.fetch("write_pause", expires_in: 1.minute) { false }
rescue
false
end
end
Step 3: Wait two seconds for any in-flight writes to complete or roll back.
Step 4: Advance every sequence on the subscriber to be ahead of the publisher. This is the single most-forgotten step in any Postgres logical replication runbook, and it causes constraint violations the moment you reopen writes.
namespace :replication do
desc "Advance sequences on subscriber past publisher values"
task advance_sequences: :environment do
publisher = ActiveRecord::Base.establish_connection(
ENV.fetch("PUBLISHER_DATABASE_URL")
).connection
subscriber = ActiveRecord::Base.establish_connection(
ENV.fetch("SUBSCRIBER_DATABASE_URL")
).connection
sequences = publisher.exec_query(<<~SQL).rows.flatten
SELECT schemaname || '.' || sequencename FROM pg_sequences
WHERE schemaname NOT IN ('pg_catalog')
SQL
sequences.each do |seq|
pub_val = publisher.exec_query(
"SELECT last_value FROM #{seq}"
).first["last_value"]
next_val = pub_val + 1000
subscriber.exec("SELECT setval('#{seq}', #{next_val}, false)")
puts "#{seq}: advanced to #{next_val}"
end
end
end
The + 1000 buffer is paranoia — it costs you a thousand IDs per sequence, which is nothing, and it gives you a safety margin against the rare race where the publisher allocates an ID during the pause window.
Step 5: Flip the application’s DATABASE_URL to the new database. On most modern hosting (Kamal, Heroku, ECS, Fly) this is a single environment variable change followed by a fast app restart. With Kamal proxy or a similar zero-downtime restart, this takes a couple of seconds and is invisible to users still seeing the maintenance response.
Step 6: Clear the write-pause flag. Writes flow to the new database.
Step 7: Verify with a known-good read-and-write probe. Insert a row, read it back, log it.
The whole sequence — pause, sequences, swap, unpause — runs under three seconds with practice. The fintech client I mentioned earlier clocked 1.6 seconds. Their P99 latency during cutover looked like a brief spike, nothing more.
I covered the general philosophy of zero-downtime database migrations in an earlier post — this is the most extreme version of it.
After Cutover: Don’t Drop the Publisher Yet
The new database is taking writes. Resist the urge to delete the old one for at least a week. Two reasons.
First, your rollback plan needs the old database intact. If something is wrong with the new cluster — a parameter group misconfiguration, an extension that did not migrate, a query plan regression on Postgres 17 — you want the option of cutting back. Once you drop the publication and the slot, that option is gone.
Second, the publisher is a free read-only audit log of the migration window. If anyone reports missing data, you can compare row counts. If a webhook handler logged something on the wrong side, you can diff.
After a week of monitoring, follow this drop sequence: drop the subscription on the subscriber, drop the publication on the publisher, drop the replication slot on the publisher, decommission the old cluster, snapshot first. The decommissioning should respect your backup policies and any retention requirements — talk to compliance if you are in a regulated industry.
When Postgres Logical Replication Is the Wrong Tool
For balance: do not reach for this if you are upgrading from Postgres 15 to 16 on a low-traffic app. pg_upgrade with link mode finishes in under a minute and a sixty-second maintenance window is cheaper than a four-week project. The complexity of Postgres logical replication for Rails earns its keep when downtime is genuinely unacceptable, when you are changing providers, when you are jumping multiple major versions, or when you are splitting a database into two.
The complexity is also not free if you have heavy bytea columns, frequent schema changes you cannot freeze, or tables without primary keys you cannot fix. Solve those first or pick a different migration strategy.
FAQ
Can Postgres logical replication upgrade across multiple major versions in one step?
Yes. The publisher and subscriber can be any combination of versions from Postgres 10 onward. You can go from 13 directly to 17 in a single replication setup — no intermediate stops needed. This is one of the main advantages over pg_upgrade, which generally wants you to hop one major version at a time.
How long does the initial COPY phase take for a large Rails database?
For a properly tuned subscriber with fast SSD and the publisher under moderate load, expect roughly 30 to 60 GB per hour for the initial table copy phase. A 500 GB Rails database typically takes 10 to 15 hours of initial sync, after which streaming catches up in minutes. Run this phase well before your cutover window — the slot will accumulate WAL on the publisher the whole time, so size the publisher disk accordingly.
What is the difference between logical replication and pg_dump / pg_restore?
pg_dump is a point-in-time snapshot — any writes that happen after the dump are missed. Logical replication is continuous, so writes that arrive during the initial copy and after are captured. You use pg_dump --schema-only to seed the subscriber, but the actual data sync is via the replication stream, which is what makes the zero-downtime cutover possible.
Do I need to change my Rails application code for logical replication to work?
No. Logical replication is invisible to the application — both publisher and subscriber are normal Postgres databases that accept normal queries. The only Rails-side change is the DATABASE_URL swap at cutover time, and optionally the write-pause middleware. If you are using Rails multi-database support you may want to keep both databases configured temporarily, but no model code changes.
Planning a Postgres major version upgrade or a database move and want a second pair of eyes on the runbook? TTB Software helps Rails teams design and execute zero-downtime database migrations with logical replication. We have been doing this for nineteen years.
Related Articles
Sidekiq to Solid Queue Migration: A Zero-Downtime Guide for Production Rails Apps
Sidekiq to Solid Queue migration guide: dual-running, draining Redis queues, cron migration, monitoring, and a safe r...
Rails Action Mailbox: Processing Inbound Emails in Production with Postmark, Mailgun and SendGrid
Rails Action Mailbox guide: process inbound emails in production with Postmark, Mailgun, SendGrid; route by domain, a...
Rails Active Job Retries: Exponential Backoff, Circuit Breakers and Dead Letter Queues
Rails Active Job retries: exponential backoff, circuit breakers, dead letter queues, idempotency and production patte...