Database Migrations That Won't Wake You at 3 AM
Database Migrations That Won’t Wake You at 3 AM
The phone buzzes. 3:17 AM. Your deployment went live at midnight, everything looked fine, and now your database is locked up tighter than a bicycle at Amsterdam Centraal.
I’ve been there. After nineteen years of Rails deployments, I’ve learned that the migrations which look the simplest are often the ones that bite hardest.
The Migration That Looks Innocent
Here’s a classic:
add_index :orders, :customer_id
One line. Clean. What could go wrong?
On a table with 50 million rows, Postgres will lock the entire table while it builds that index. Depending on your hardware, that’s anywhere from 30 seconds to several minutes where nobody can insert, update, or delete orders. Your checkout flow? Dead. Customer service queue? Growing.
Building Indexes Without the Drama
Postgres has a CONCURRENTLY option for index creation. It takes longer, but it won’t block writes:
class AddCustomerIdIndexToOrders < ActiveRecord::Migration[7.1]
disable_ddl_transaction!
def change
add_index :orders, :customer_id, algorithm: :concurrently
end
end
The disable_ddl_transaction! is crucial. Concurrent index builds can’t run inside a transaction, so you need to tell Rails to skip wrapping this migration in one.
One catch: if the migration fails halfway through, you’ll have an invalid index sitting there. Always check for stragglers:
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;
The Expand and Contract Pattern
Renaming a column seems straightforward enough:
rename_column :users, :email, :email_address
The problem? Your running application still expects email. The moment that migration runs, every request trying to read users.email explodes.
The safer path takes three deployments:
Deploy 1: Expand
class AddEmailAddressToUsers < ActiveRecord::Migration[7.1]
def change
add_column :users, :email_address, :string
end
end
Add a callback in your model to keep both columns in sync:
before_save :sync_email_columns
def sync_email_columns
self.email_address = email if email_changed?
self.email = email_address if email_address_changed?
end
Deploy 2: Migrate
Update your application code to use email_address everywhere. Run a backfill script to copy existing data. Take your time—nothing is broken.
Deploy 3: Contract
class RemoveEmailFromUsers < ActiveRecord::Migration[7.1]
def change
remove_column :users, :email, :string
end
end
Yes, it’s three deployments for a column rename. But you know what’s faster than three careful deployments? Not having to restore from backup at 4 AM while your CTO breathes down your neck.
Lock Timeouts: Your Safety Net
Even with careful planning, things slip through. A migration that works fine in staging with 10,000 rows can choke on production with 10 million.
Set a lock timeout in your migrations:
class SafeMigration < ActiveRecord::Migration[7.1]
def change
execute "SET lock_timeout = '5s'"
# your migration here
end
end
If the migration can’t acquire a lock within 5 seconds, it fails instead of waiting indefinitely (while stacking up a queue of blocked queries behind it).
Better yet, configure this globally in your database.yml for migrations:
production:
lock_timeout: 5000
statement_timeout: 60000
The Strong Migrations Gem
Andrew Kane’s strong_migrations gem catches dangerous migrations before they hit production. It’ll yell at you for:
- Adding indexes without
CONCURRENTLY - Adding columns with default values (before Postgres 11, this rewrote the entire table)
- Changing column types in ways that lock tables
- Removing columns that your code might still reference
Install it. Configure it. Listen to it.
# Gemfile
gem "strong_migrations"
The gem won’t stop you from doing risky things—sometimes you genuinely need to. But it forces you to acknowledge the risk explicitly, which means you’re more likely to schedule that migration for 3 AM yourself, with a cup of coffee, rather than getting woken up by PagerDuty.
Backfilling Large Tables
Got a new column that needs to be populated from existing data? The tempting approach:
User.update_all(full_name: "#{first_name} #{last_name}")
On a large table, this is a single massive transaction that locks rows and bloats your transaction log.
Batch it:
User.in_batches(of: 1000) do |batch|
batch.update_all("full_name = first_name || ' ' || last_name")
sleep(0.1) # breathe
end
The sleep might seem odd, but it gives your database room to handle normal traffic between batches. Without it, you’re essentially running a denial-of-service attack against yourself.
For really large tables, consider running backfills as background jobs rather than migrations. Migrations should be fast and reversible. A backfill that takes 6 hours is neither.
Testing Migrations Against Production Data
If your team uses feature flags to control rollouts, you can combine them with migrations: deploy the migration first, then gradually enable the code that depends on the new schema.
Your staging database has 500 users. Production has 2 million. That migration that runs in 0.3 seconds on staging? It might take 45 minutes on production.
Options:
-
Anonymized production dumps: Copy production data structure and volume to staging, with sensitive fields scrubbed. Expensive in storage and maintenance, but accurate.
-
Query plans: Before running, check the migration’s query plan on production (read-only).
EXPLAIN ANALYZEwill show you if Postgres plans a full table scan. -
Multiplication estimates: If staging has 500 rows and migration takes 0.3 seconds, production with 2 million rows might take roughly 1,200 seconds (20 minutes). This math is wildly imprecise, but it’s better than no estimate.
When Things Go Wrong Anyway
Sometimes despite all precautions, a migration still causes problems. Have a rollback plan ready:
- Can the migration be reversed? Rails makes this easy for most DDL changes.
- If you’re adding a column, can you drop it quickly?
- If you’re removing a column, do you have the data backed up somewhere?
- Who has production database access at 3 AM? (Write down their phone numbers before you need them.)
Document your runbook. The moment you’re panicking is the worst time to figure out rollback procedures.
The Culture Shift
Zero-downtime deployments require more than technical tricks. They require patience. That column rename takes three deploys instead of one. That index addition needs a separate migration from the feature code.
This feels slower at first. It is slower at first. But you make up that time—and then some—by not having production incidents. By not debugging at 3 AM. By not having angry customers.
Sleep well. Your database will be fine.
Frequently Asked Questions
Can I run zero-downtime migrations on MySQL, or is this PostgreSQL-only?
Most techniques described here work on both PostgreSQL and MySQL. Concurrent index creation (CONCURRENTLY) is PostgreSQL-specific, but MySQL 8.0+ supports online DDL for most index operations with ALGORITHM=INPLACE. The expand-and-contract pattern and lock timeouts apply to both databases.
How do I know if a migration is safe to run without downtime?
Use the strong_migrations gem as your first line of defense — it flags known-dangerous patterns automatically. Beyond that, test the migration against a dataset that matches production volume, and check the query plan with EXPLAIN ANALYZE. If the migration requires an exclusive lock on a large table, it’s not safe without modification.
Should I run migrations during deployment or separately?
For zero-downtime deploys, run migrations before deploying the new application code. This way, the old code keeps running against the updated schema. Your migration must be backward-compatible with the currently running code — the expand-and-contract pattern guarantees this.
What’s the safest way to remove a column from a production table?
First, stop all application code from reading or writing the column (deploy that change). Then wait for active queries to finish. Finally, run the migration to drop the column. Rails’ ignored_columns setting lets you tell ActiveRecord to stop using a column before it’s physically removed.
Need help modernizing your deployment pipeline? TTB Software specializes in Rails DevOps and zero-downtime deployments. We’ve been doing this for nineteen years—we’ve made all the mistakes so you don’t have to.
About the Author
Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.
Get in TouchRelated Articles
Need Expert Rails Development?
Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise
Schedule a Free Consultation