Database Migrations That Won’t Wake You at 3 AM

The phone buzzes. 3:17 AM. Your deployment went live at midnight, everything looked fine, and now your database is locked up tighter than a bicycle at Amsterdam Centraal.

I’ve been there. After nineteen years of Rails deployments, I’ve learned that the migrations which look the simplest are often the ones that bite hardest.

The Migration That Looks Innocent

Here’s a classic:

add_index :orders, :customer_id

One line. Clean. What could go wrong?

On a table with 50 million rows, Postgres will lock the entire table while it builds that index. Depending on your hardware, that’s anywhere from 30 seconds to several minutes where nobody can insert, update, or delete orders. Your checkout flow? Dead. Customer service queue? Growing.

Building Indexes Without the Drama

Postgres has a CONCURRENTLY option for index creation. It takes longer, but it won’t block writes:

class AddCustomerIdIndexToOrders < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def change
    add_index :orders, :customer_id, algorithm: :concurrently
  end
end

The disable_ddl_transaction! is crucial. Concurrent index builds can’t run inside a transaction, so you need to tell Rails to skip wrapping this migration in one.

One catch: if the migration fails halfway through, you’ll have an invalid index sitting there. Always check for stragglers:

SELECT indexrelid::regclass, indisvalid 
FROM pg_index 
WHERE NOT indisvalid;

The Expand and Contract Pattern

Renaming a column seems straightforward enough:

rename_column :users, :email, :email_address

The problem? Your running application still expects email. The moment that migration runs, every request trying to read users.email explodes.

The safer path takes three deployments:

Deploy 1: Expand

class AddEmailAddressToUsers < ActiveRecord::Migration[7.1]
  def change
    add_column :users, :email_address, :string
  end
end

Add a callback in your model to keep both columns in sync:

before_save :sync_email_columns

def sync_email_columns
  self.email_address = email if email_changed?
  self.email = email_address if email_address_changed?
end

Deploy 2: Migrate Update your application code to use email_address everywhere. Run a backfill script to copy existing data. Take your time—nothing is broken.

Deploy 3: Contract

class RemoveEmailFromUsers < ActiveRecord::Migration[7.1]
  def change
    remove_column :users, :email, :string
  end
end

Yes, it’s three deployments for a column rename. But you know what’s faster than three careful deployments? Not having to restore from backup at 4 AM while your CTO breathes down your neck.

Lock Timeouts: Your Safety Net

Even with careful planning, things slip through. A migration that works fine in staging with 10,000 rows can choke on production with 10 million.

Set a lock timeout in your migrations:

class SafeMigration < ActiveRecord::Migration[7.1]
  def change
    execute "SET lock_timeout = '5s'"
    # your migration here
  end
end

If the migration can’t acquire a lock within 5 seconds, it fails instead of waiting indefinitely (while stacking up a queue of blocked queries behind it).

Better yet, configure this globally in your database.yml for migrations:

production:
  lock_timeout: 5000
  statement_timeout: 60000

The Strong Migrations Gem

Andrew Kane’s strong_migrations gem catches dangerous migrations before they hit production. It’ll yell at you for:

Adding indexes without CONCURRENTLY
Adding columns with default values (before Postgres 11, this rewrote the entire table)
Changing column types in ways that lock tables
Removing columns that your code might still reference

Install it. Configure it. Listen to it.

# Gemfile
gem "strong_migrations"

The gem won’t stop you from doing risky things—sometimes you genuinely need to. But it forces you to acknowledge the risk explicitly, which means you’re more likely to schedule that migration for 3 AM yourself, with a cup of coffee, rather than getting woken up by PagerDuty.

Backfilling Large Tables

Got a new column that needs to be populated from existing data? The tempting approach:

User.update_all(full_name: "#{first_name} #{last_name}")

On a large table, this is a single massive transaction that locks rows and bloats your transaction log.

Batch it:

User.in_batches(of: 1000) do |batch|
  batch.update_all("full_name = first_name || ' ' || last_name")
  sleep(0.1)  # breathe
end

The sleep might seem odd, but it gives your database room to handle normal traffic between batches. Without it, you’re essentially running a denial-of-service attack against yourself.

For really large tables, consider running backfills as background jobs rather than migrations. Migrations should be fast and reversible. A backfill that takes 6 hours is neither.

Testing Migrations Against Production Data

If your team uses feature flags to control rollouts, you can combine them with migrations: deploy the migration first, then gradually enable the code that depends on the new schema.

Your staging database has 500 users. Production has 2 million. That migration that runs in 0.3 seconds on staging? It might take 45 minutes on production.

Options:

Anonymized production dumps: Copy production data structure and volume to staging, with sensitive fields scrubbed. Expensive in storage and maintenance, but accurate.
Query plans: Before running, check the migration’s query plan on production (read-only). EXPLAIN ANALYZE will show you if Postgres plans a full table scan.
Multiplication estimates: If staging has 500 rows and migration takes 0.3 seconds, production with 2 million rows might take roughly 1,200 seconds (20 minutes). This math is wildly imprecise, but it’s better than no estimate.

When Things Go Wrong Anyway

Sometimes despite all precautions, a migration still causes problems. Have a rollback plan ready:

Can the migration be reversed? Rails makes this easy for most DDL changes.
If you’re adding a column, can you drop it quickly?
If you’re removing a column, do you have the data backed up somewhere?
Who has production database access at 3 AM? (Write down their phone numbers before you need them.)

Document your runbook. The moment you’re panicking is the worst time to figure out rollback procedures.

The Culture Shift

Zero-downtime deployments require more than technical tricks. They require patience. That column rename takes three deploys instead of one. That index addition needs a separate migration from the feature code.

This feels slower at first. It is slower at first. But you make up that time—and then some—by not having production incidents. By not debugging at 3 AM. By not having angry customers.

Sleep well. Your database will be fine.

Frequently Asked Questions

Can I run zero-downtime migrations on MySQL, or is this PostgreSQL-only?

Most techniques described here work on both PostgreSQL and MySQL. Concurrent index creation (CONCURRENTLY) is PostgreSQL-specific, but MySQL 8.0+ supports online DDL for most index operations with ALGORITHM=INPLACE. The expand-and-contract pattern and lock timeouts apply to both databases.

How do I know if a migration is safe to run without downtime?

Use the strong_migrations gem as your first line of defense — it flags known-dangerous patterns automatically. Beyond that, test the migration against a dataset that matches production volume, and check the query plan with EXPLAIN ANALYZE. If the migration requires an exclusive lock on a large table, it’s not safe without modification.

Should I run migrations during deployment or separately?

For zero-downtime deploys, run migrations before deploying the new application code. This way, the old code keeps running against the updated schema. Your migration must be backward-compatible with the currently running code — the expand-and-contract pattern guarantees this.

What’s the safest way to remove a column from a production table?

First, stop all application code from reading or writing the column (deploy that change). Then wait for active queries to finish. Finally, run the migration to drop the column. Rails’ ignored_columns setting lets you tell ActiveRecord to stop using a column before it’s physically removed.

Need help modernizing your deployment pipeline? TTB Software specializes in Rails DevOps and zero-downtime deployments. We’ve been doing this for nineteen years—we’ve made all the mistakes so you don’t have to.

Database Migrations That Won't Wake You at 3 AM

Database Migrations That Won’t Wake You at 3 AM

The Migration That Looks Innocent

Building Indexes Without the Drama

The Expand and Contract Pattern

Lock Timeouts: Your Safety Net

The Strong Migrations Gem

Backfilling Large Tables

Testing Migrations Against Production Data

When Things Go Wrong Anyway

The Culture Shift

Frequently Asked Questions

Can I run zero-downtime migrations on MySQL, or is this PostgreSQL-only?

How do I know if a migration is safe to run without downtime?

Should I run migrations during deployment or separately?

What’s the safest way to remove a column from a production table?

About the Author

Share this article

Related Articles

Rails Concerns: When They Clean Up Code and When They Create Hidden Complexity

Rails 8 Multiple Databases: Read Replicas, Sharding, and Automatic Role Switching

Custom Rack Middleware in Rails 8: A Practical Guide with Real Examples

Need Expert Rails Development?