35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English 35+ Years Experience Netherlands Based ⚡ Fast Response Times Ruby on Rails Experts AI-Powered Development Fixed Pricing Available Senior Architects Dutch & English
Rails Active Job Retries: Exponential Backoff, Circuit Breakers and Dead Letter Queues

Rails Active Job Retries: Exponential Backoff, Circuit Breakers and Dead Letter Queues

Roger Heykoop
Ruby on Rails, DevOps
Rails Active Job retries: exponential backoff, circuit breakers, dead letter queues, idempotency and production patterns for resilient background work.

A founder called me at 2am on a Tuesday last September because his SaaS had been quietly losing money for sixteen days. The story was simple: his Stripe webhook handler enqueued an Active Job to provision the customer’s account; the provisioning job called a third-party API; the third-party API had been throwing 503s for ten seconds every few minutes; the job’s default retries had run out; and three hundred and forty paying customers had been silently dropped into a “payment received, account never created” black hole. The job had retry_on StandardError, attempts: 3. It looked correct. It was wrong in the specific way that production loves: it failed gracefully into nothing.

After nineteen years of Rails I have watched teams ship background jobs with the cheerful assumption that the job will run, and run successfully, and run exactly once. None of those three are true at scale. Rails Active Job retries are the difference between an app that survives the bad afternoon at AWS and an app that loses customer money. This post is the playbook I give every team I work with: how to design Rails Active Job retries that actually recover, when to use exponential backoff, how to wire in circuit breakers, and where to put a dead letter queue so the next 2am call wakes up a dashboard instead of a CTO.

Why Default Active Job Retries Are Not Enough

The default behavior in Rails Active Job retries is that any unhandled exception bubbles up, the queue adapter sees a failure, and depending on the adapter, the job is either silently dropped (some old configurations), retried with adapter-specific backoff (Sidekiq), or retried using whatever you wrote in retry_on (Solid Queue, GoodJob, Que).

Three things go wrong with the defaults:

  1. Retry counts are too low. attempts: 3 is the example everyone copies. A 5-minute outage at a downstream provider with a 15-second backoff retries three times in 45 seconds and then gives up. You needed to retry for an hour.
  2. Backoff is uniform. A flat “wait 30 seconds and try again” hammers the recovering downstream service the second it comes back up — every queued job in your fleet hits it at the same moment, and you re-trigger the outage you were waiting out.
  3. Failed jobs vanish. Without an explicit dead letter strategy, a permanently-failed job becomes a row in a database table or a line in a log file. Nobody looks at it. The customer never gets the email. You find out in October.

The right model: every job is a unit of work that may fail for one of three reasons, and each reason needs its own response. Transient failures (network blip, downstream 503, DB deadlock) want retries with exponential backoff. Persistent failures (bad input data, expired token) want one retry at most, then escalation. Code bugs (NoMethodError, ArgumentError) want zero retries — they want a deploy.

Exponential Backoff with Jitter

The single most useful pattern in Rails Active Job retries is exponential backoff with jitter. Active Job has supported this natively since Rails 7.1:

class ProvisionAccountJob < ApplicationJob
  queue_as :default

  retry_on Stripe::APIConnectionError,
           Net::OpenTimeout,
           Net::ReadTimeout,
           wait: :polynomially_longer,
           attempts: 10

  discard_on ActiveJob::DeserializationError

  def perform(customer_id)
    customer = Customer.find(customer_id)
    Provisioning::Pipeline.new(customer).call
  end
end

wait: :polynomially_longer is Rails’ built-in helper that backs off as executions ** 4 + 2 + rand(executions) seconds. After ten attempts the total elapsed wait is several hours — long enough to ride out almost every real outage I have seen. The rand(executions) is jitter: the small random offset that prevents every retrying job in your fleet from hitting the recovering API at the same millisecond.

If you need finer control, write the function yourself:

class ChargeCardJob < ApplicationJob
  retry_on Stripe::RateLimitError, attempts: 8 do |job, error|
    delay = (2 ** job.executions) + rand(0..30)
    job.class.set(wait: delay.seconds).perform_later(*job.arguments)
  end
end

Two things to notice. First, the block form gives you the job and the error and lets you reschedule explicitly — you can read response headers for a Retry-After value and use that. Second, when a downstream API tells you Retry-After: 60, you respect it. The single fastest way to get permanently rate-limited by Stripe or Twilio is to ignore their backoff hints.

The post on Rails webhook processing with idempotency makes the related point about handling the inbound side: at any network boundary, requests get retried, duplicated, and reordered. The job system is the outbound side of the same problem.

Idempotency Is Not Optional

If your job is going to retry, it is going to run twice. Sometimes ten times. Rails Active Job retries are only safe when the job is idempotent — running it twice produces the same end state as running it once.

The pattern I push every team toward:

class ProvisionAccountJob < ApplicationJob
  retry_on Net::OpenTimeout, attempts: 8, wait: :polynomially_longer

  def perform(customer_id)
    customer = Customer.find(customer_id)

    return if customer.provisioned_at.present?

    ApplicationRecord.transaction do
      customer.with_lock do
        return if customer.reload.provisioned_at.present?

        Provisioning::Pipeline.new(customer).call
        customer.update!(provisioned_at: Time.current)
      end
    end
  end
end

Three guards: an early return on the unloaded record (cheap), a row-level lock to serialize concurrent retries, and a re-check after the lock to handle the case where another worker provisioned the customer while we were waiting for the lock. This is belt and suspenders. In production, both eventually save you.

For jobs that call external APIs, idempotency keys are the right tool. Stripe, Twilio, SendGrid, and most modern providers accept an Idempotency-Key header. Generate one deterministically from the job arguments and pass it on every retry:

def perform(charge_id)
  charge = Charge.find(charge_id)
  Stripe::Charge.create(
    {
      amount: charge.amount_cents,
      currency: "usd",
      customer: charge.stripe_customer_id,
    },
    idempotency_key: "charge-#{charge.id}-attempt-anchor"
  )
end

The key is the charge, not the attempt. Every retry sends the same key. Stripe deduplicates server-side. The customer gets billed once even if your job runs eleven times. For deeper guarantees, Postgres advisory locks give you mutual exclusion across workers without a row lock.

Discriminating Errors: Retry, Discard, Escalate

The default rescue_from StandardError mindset is poison for Rails Active Job retries. Different errors deserve different responses:

class SendInvoiceJob < ApplicationJob
  # Retry transient infrastructure errors generously.
  retry_on Net::OpenTimeout, Net::ReadTimeout,
           Errno::ECONNRESET, Errno::ECONNREFUSED,
           wait: :polynomially_longer, attempts: 10

  # Retry the API's own transient errors with their hint.
  retry_on Stripe::RateLimitError, attempts: 12 do |job, error|
    job.class.set(wait: error.retry_after || 30).perform_later(*job.arguments)
  end

  # Bad input data — no retry, escalate.
  discard_on ActiveRecord::RecordNotFound do |job, error|
    Sentry.capture_exception(error, extra: { job: job.class.name, args: job.arguments })
  end

  # Programming errors — fail loudly.
  # (Anything not handled above bubbles up and goes to dead-letter.)

  def perform(invoice_id)
    Invoice.find(invoice_id).deliver!
  end
end

The shape of that file is the lesson. There is no single retry policy. There are policies per error class, organized from “definitely retry” to “definitely give up.” If you cannot articulate which exceptions should retry and which should not, you do not yet know what your job does.

Circuit Breakers for Downstream Providers

Exponential backoff is a per-job pattern. When all of your jobs are calling the same downstream — say, every webhook processor calls SendGrid — you also need a fleet-level response when the downstream goes down. That is what a circuit breaker is for.

The pattern in plain Ruby (the stoplight gem packages this nicely):

class SendGridClient
  Failure = Class.new(StandardError)

  def deliver(message)
    Stoplight("sendgrid")
      .with_threshold(5)
      .with_cool_off_time(60)
      .with_error_handler { |error, handle| handle.call(error) unless error.is_a?(Failure) }
      .run { post_to_sendgrid(message) }
  end
end

Five consecutive failures and the circuit opens: subsequent calls fail immediately for sixty seconds without touching SendGrid. Jobs that hit an open circuit raise a known exception that retry_on reschedules with backoff. The downstream gets a chance to recover instead of being hammered by ten thousand queued jobs at the moment it comes back up. The breaker closes after a successful probe. Latency drops back to normal.

A subtle point: the breaker state must live somewhere shared across workers. Stoplight defaults to in-process; in production, point it at Redis so all your workers see the same circuit state. Without shared state, every worker discovers the outage independently and you get no fleet-level protection.

Dead Letter Queues: Where Failed Jobs Go to Be Found

After all the retries are exhausted, the job has to go somewhere a human can find it. The queue adapter matters here. Sidekiq has a built-in dead-set. Solid Queue stores failed jobs in solid_queue_failed_executions with the full error, backtrace, and arguments. GoodJob keeps them in good_jobs with error and finished_at.

The mistake teams make is trusting the adapter’s default UI to be the operational endpoint. It is not. The pattern that works:

  1. After all retries are exhausted, write a domain-meaningful row to a failed_jobs table you own. Include the job class, arguments, error message, error class, and a resolved_at.
  2. Page on-call when the table grows past a threshold for a critical job class.
  3. Build a tiny admin UI that lists failed jobs, lets a human inspect them, and offers a one-click retry and discard.
  4. Make a weekly review of the table part of the on-call handoff. Patterns of failure are a free roadmap.
class ApplicationJob < ActiveJob::Base
  rescue_from(StandardError) do |error|
    if executions >= self.class.retry_attempts_for(error)
      FailedJob.create!(
        job_class: self.class.name,
        arguments: arguments,
        error_class: error.class.name,
        error_message: error.message,
        backtrace: error.backtrace.first(20).join("\n"),
        failed_at: Time.current
      )
      Sentry.capture_exception(error)
    end
    raise
  end
end

The raise at the end matters: you still want the queue adapter to mark the job as failed. The FailedJob row is your operational layer on top, not a replacement for the adapter’s tracking. The post on Rails Solid Queue covers the underlying mechanics if you want to skip Redis entirely.

Time-Boxing Long-Running Jobs

Exponential backoff is dangerous when combined with jobs that already take a long time. A 12-minute job that retries ten times polynomially can be in the system for a week. Rails 7.1 added sidekiq_options expires_in: and Solid Queue/GoodJob have similar concepts; Active Job itself does not have a first-class deadline, so set one explicitly:

class GenerateReportJob < ApplicationJob
  DEADLINE = 30.minutes

  retry_on Net::OpenTimeout, attempts: 5, wait: :polynomially_longer

  def perform(report_id, enqueued_at: Time.current.iso8601)
    if Time.current - Time.parse(enqueued_at) > DEADLINE
      Rails.logger.warn("dropping #{self.class} #{report_id}: deadline exceeded")
      ReportStatusMailer.expired(report_id).deliver_later
      return
    end

    Reports::Generator.new(report_id).call
  end
end

The job carries its own enqueued time and self-deletes after the deadline. The customer gets a “we couldn’t generate this in time, please try again” email instead of a report that arrives the next morning.

Observability: You Cannot Fix What You Cannot See

Every team I audit that has retry problems also has visibility problems. Wire up two things at minimum:

ActiveSupport::Notifications.subscribe("retry_stopped.active_job") do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  job = event.payload[:job]
  Statsd.increment("active_job.retry_stopped", tags: ["job:#{job.class.name}"])
  Sentry.capture_exception(event.payload[:error], extra: { job: job.class.name, args: job.arguments })
end

ActiveSupport::Notifications.subscribe("enqueue_retry.active_job") do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  Statsd.increment("active_job.retry", tags: ["job:#{event.payload[:job].class.name}"])
end

Now you have a dashboard line per job class for “retried” and “gave up.” A retry rate spike is the leading indicator of a downstream outage; a “gave up” spike is the leading indicator of an incident. Page on the second one. Watch the first one. The team I rebuilt this for went from “we found out via Twitter” to “we paged ourselves before the customer noticed” inside two weeks.

Production Numbers

Three apps where wiring Rails Active Job retries correctly changed the outcome:

  • A telehealth scheduling app: a 4-hour AWS networking degradation that previously dropped 18,000 appointment-confirmation emails dropped exactly zero after we wired exponential backoff with attempts: 12 and a circuit breaker on the SMS provider. Total cost: 90 minutes of pair programming.
  • A fintech onboarding flow: introducing a failed_jobs table and a 5-line admin page reduced “ghost users stuck in pending state” from 120/month to under 5/month. Customer support tickets about “where is my account?” dropped 71%.
  • A B2B SaaS: replacing fixed wait: 30.seconds, attempts: 3 with :polynomially_longer, attempts: 10 and idempotency keys on every external call eliminated a recurring weekly incident where a flaky vendor took down their billing pipeline.

The wins were small in code and large in operations. Every one of these teams told me the same thing afterward: “I cannot believe how stable the queue is now.”

Frequently Asked Questions

What is the right number of attempts for Rails Active Job retries?

It depends on what the job calls and how long the worst plausible outage lasts. For pure database work, 3 attempts is plenty. For external API calls subject to provider outages, 8 to 12 attempts with wait: :polynomially_longer covers multi-hour incidents while staying under a day of total elapsed time. The number is a function of “how long do I want to wait before I give up and tell a human?” not a magic constant.

Are Rails Active Job retries safe without idempotency?

No. Any job that retries can run twice; designing the job so that running it twice is harmless is the prerequisite for using retries at all. If your job sends an email, charges a card, or writes to an external system, you need either an idempotency key on the external call, a database guard (“did we already do this?”), or both. Without idempotency, retries make a bad day worse by duplicating side effects.

Should I use Sidekiq retry options or Active Job retry_on?

Use Active Job’s retry_on and discard_on. They work across queue adapters (Sidekiq, Solid Queue, GoodJob, Que) and let you swap backends without rewriting jobs. Sidekiq’s adapter-specific retry is fine for Sidekiq-only apps, but you lose portability and you cannot express the same policy on a Solid Queue worker. Active Job retries also expose ActiveSupport::Notifications events you can subscribe to for observability.

How do I retry a failed Active Job after the retries are exhausted?

If you logged the failure to a failed_jobs table, retrying is Job.constantize.perform_later(*arguments) from your admin UI. If you are relying on the queue adapter, Sidekiq has a “Retry” button in its dashboard, Solid Queue exposes failed_executions.retry_all, and GoodJob has a similar API. The important part is making the action one click; the more steps it takes, the longer customers wait while a human reads logs.


Need help making a Rails background job pipeline survive its bad afternoons? TTB Software specializes in Rails reliability, queue architecture, and incident response for product teams. We have been doing this for nineteen years.

#rails-active-job #active-job-retry #rails-background-jobs #exponential-backoff #circuit-breaker #dead-letter-queue #ruby-on-rails
R

About the Author

Roger Heykoop is a senior Ruby on Rails developer with 19+ years of Rails experience and 35+ years in software development. He specializes in Rails modernization, performance optimization, and AI-assisted development.

Get in Touch

Share this article

Need Expert Rails Development?

Let's discuss how we can help you build or modernize your Rails application with 19+ years of expertise

Schedule a Free Consultation