588 Database Migrations on a Fresh Postgres — Why They All Failed

We had 588 database migrations that worked perfectly in staging. They’d been applied one at a time over months, each PR adding a migration or two. Then we tried to replay all 588 on a fresh database. Every single migration strategy failed.

This is the story of why migrations designed for incremental apply break on full replay, and how we made them work.

The Context

The project uses Sqitch for database migrations. Each migration has three files: deploy (apply), revert (rollback), and verify (check). Staging had 588 successful migrations applied incrementally — one at a time as features shipped.

We needed to bootstrap a new environment from scratch. Same code, fresh Postgres, zero state. Should be simple: sqitch deploy replays everything from migration 1 to 588.

The First Failure: ADD VALUE

ALTER TYPE app.subaccount_status ADD VALUE 'archived';

Postgres ADD VALUE for enum types is not idempotent. If the value already exists (from a previous migration that partially applied before failing), the migration crashes. There’s no IF NOT EXISTS for ADD VALUE in older Postgres versions.

On incremental apply, this never fails — each migration runs once. On full replay, if a migration fails partway and retries, the enum value might already exist.

The Second Failure: CREATE SCHEMA

CREATE SCHEMA app_private;

Same problem. CREATE SCHEMA without IF NOT EXISTS. On a retry, the schema exists, the migration fails.

The Third Failure: Revert Cascades

Sqitch’s default behavior on failure: revert the failed migration, then retry. But the revert scripts had their own problems — DROP SCHEMA CASCADE would nuke tables created by earlier migrations. One failure cascaded into destroying half the database state.

The Fix: sed Patches + Retry Loop

Since we couldn’t modify the migration files (they’re in a shared schema repo used by staging and production), we patched them at runtime:

# Copy migrations to writable location
cp -r /sqitch /tmp/sqitch-work
cd /tmp/sqitch-work

# Patch 1: ADD VALUE → ADD VALUE IF NOT EXISTS
find deploy/ -name "*.sql" -exec \
  sed -i -E "s/([Aa][Dd][Dd] [Vv][Aa][Ll][Uu][Ee]) '/\1 IF NOT EXISTS '/g" {} +

# Patch 2: CREATE SCHEMA → CREATE SCHEMA IF NOT EXISTS
find deploy/ -name "*.sql" -exec \
  sed -i -E "s/([Cc][Rr][Ee][Aa][Tt][Ee] [Ss][Cc][Hh][Ee][Mm][Aa]) ([a-z_]+);/\1 IF NOT EXISTS \2;/g" {} +

For the revert cascade problem, we used --no-verify --mode change:

--no-verify: skip verify scripts (they reference state that might not exist on fresh DB)
--mode change: on failure, only revert the single failed migration, not everything

And a retry loop that log-skips permanently failing migrations:

MAX_RETRIES=30
for attempt in $(seq 1 $MAX_RETRIES); do
  if sqitch deploy --plan-file "$PLAN" --no-verify --mode change 2>&1; then
    echo "All migrations deployed"
    break
  fi
  # Get the stuck migration
  NEXT=$(sqitch status --plan-file "$PLAN" 2>&1 | grep "^  \*" | head -1 | awk '{print $2}')
  if [ -z "$NEXT" ]; then
    echo "No more undeployed changes"
    break
  fi
  # Log-only skip it (mark as deployed without running)
  echo "Skipping $NEXT"
  sqitch deploy --plan-file "$PLAN" --log-only --to "$NEXT" 2>&1 || true
done

The Result

588 out of 588 migrations applied. Some were sed-patched, some were log-skipped (truly incompatible with fresh state), but the final schema matched staging exactly.

Why This Is a Common Problem

Most migration tools (Sqitch, Flyway, Alembic, Rails) are designed for incremental apply — each migration runs once, in order, on an existing database. The assumption is that the previous state exists.

Full replay from zero breaks this assumption:

Pattern	Works incrementally	Fails on replay
`ADD VALUE 'x'`	Yes (value doesn’t exist)	Fails on retry (value exists)
`CREATE SCHEMA x`	Yes (schema doesn’t exist)	Fails on retry (schema exists)
`INSERT INTO seeds`	Yes (table empty)	Fails (duplicate key)
`DROP ... CASCADE` in revert	Rarely runs	Nukes dependent objects

The Better Solution

For new projects: make every migration idempotent from day one. Use IF NOT EXISTS, ON CONFLICT DO NOTHING, and never write reverts with CASCADE.

For existing projects with hundreds of non-idempotent migrations: maintain a “golden snapshot” — a pg_dump of the current schema that can bootstrap new environments in seconds. Run migrations incrementally only for changes after the snapshot. We’re moving toward this approach.

Takeaway

If your database has more than ~50 migrations, test a full replay on an empty database before you need it. The earlier you find non-idempotent migrations, the cheaper they are to fix. By migration 588, the only option was runtime sed patches and retry loops.