897 Migrations, Five Databases, One Order — Why Plain Deploy Worked

I spent a week building sed patches, retry loops, and skip-on-failure logic to replay 590 database migrations on a fresh Postgres. A colleague said: “just run them in the right order.” He was right. Plain sqitch deploy worked for all 897 migrations across five databases. Zero patches needed.

The Background

The project I was working on had five databases — index_data, pool_core, contracts, platform, payments — managed by Sqitch. Combined, they have 897 migrations. In production, they’re applied incrementally: one migration per PR merge. Nobody ever replays them all from scratch.

Except when you build a new test environment from nothing.

The First Attempt

I tried running all 590 pool_core migrations on a fresh database:

sqitch deploy --plan-file sqitch.pool-core.plan

It failed at migration 87. ADD VALUE '5 minute' — no IF NOT EXISTS. The enum value already existed from an earlier migration that created the type. I patched it with sed:

find deploy/ -name "*.sql" -exec sed -i -E \
  "s/([Aa][Dd][Dd] [Vv][Aa][Ll][Uu][Ee]) '/\1 IF NOT EXISTS '/g" {} +

Then migration 94 failed. CREATE SCHEMA alt_coin — no IF NOT EXISTS. More sed patches. Then revert cascades broke. Added --no-verify --mode change. Then a retry loop with --log-only to skip permanently broken migrations.

The final script was 80 lines of bash, got 587/588 passing, and took 35 minutes.

What Changed

A senior engineer set up staging’s fresh database using the exact same Sqitch images. His manifests had three lines:

sqitch status --plan-file ${PATH_TO_PLAN}
sqitch deploy --plan-file ${PATH_TO_PLAN}
sqitch status --plan-file ${PATH_TO_PLAN}

No patches. No retries. No --no-verify. It just worked.

The Difference: Cross-Database Ordering

The five databases have hidden dependencies:

index_data (8 migrations)
    ↓ dblink foreign server
contracts (134 migrations)
    ↑ linked_users
pool_core (587 migrations)
    ↓ user_id reference
platform (128 migrations)

payments (43 migrations) — independent

My first attempt ran pool_core first, then contracts. Contracts creates a foreign server pointing to index_data via dblink. If index_data tables don’t exist yet, the migration silently creates broken references. Later migrations that query those references fail.

The correct order:

index_data — creates the tables contracts needs to link to
pool_core — creates the tables platform references
contracts — creates the foreign server to index_data (both exist now)
platform — references pool_core tables
payments — independent

When I followed this order, plain sqitch deploy worked for all 897 migrations. No sed patches, no retries, no skips.

The Two Fixes That Were Actually Needed

The order alone wasn’t enough. Two infrastructure issues had to be resolved:

1. pg_hba.conf blocking localhost dblink. Contracts connects to index_data via dblink on localhost. The Postgres pg_hba rules rejected non-SSL connections from 127.0.0.1. Added explicit rules:

host index_data data_reader 127.0.0.1/32 scram-sha-256

2. Hardcoded password mismatch. A migration creates the data_reader role with password 'data_reader'. The Postgres operator had already created the role with a random 64-character password. The dblink connection used the hardcoded password, which didn’t match.

Takeaway

Before writing workarounds for broken migrations, check if the problem is ordering. Cross-database dependencies are invisible in individual migration files — they only surface when you replay everything from scratch. The fix wasn’t in the SQL; it was in the sequence.