Your Database Schema Is Not Zero-Downtime
You deploy your application with zero downtime, Blue-green, canary, rolling updates. You've done the work but unfortunately there's one part of most deployment pipelines that none of those strategies protect, the database schema migration.
Run ALTER TABLE orders ADD COLUMN status VARCHAR(50) NOT NULL DEFAULT 'pending' on a Postgres table with 50 million rows and you've just acquired an ACCESS EXCLUSIVE lock.
Nothing reads, nothing writes and that lock is held for the entire duration of the migration.
On a well-tuned instance with warm buffers, adding a column with a volatile default takes approximately 8-12 minutes on a 50M row table. On a shared RDS instance under load? You can bet longer. While that lock is held, every query that touches that table queues up behind it.
Connection pools fill and timeouts cascade until you have an outage. The worst part is that your deployment pipeline said the rollout was a "success".
What Postgres Actually Does During ALTER TABLE
Postgres has eight lock modes. ALTER TABLE operations typically require ACCESS EXCLUSIVE which is the most restrictive as it blocks all concurrent reads and writes.
Here's a breakdown of the sequence of events that take place behind the scenes:
ALTER TABLErequests anACCESS EXCLUSIVElock- Postgres waits for all current queries on that table to finish
- Every new query that arrives queues behind the lock request
- The migration runs (seconds to minutes depending on operation)
- Lock releases, queued queries execute
Step 3 is the killer. If a long-running report or a slow OLAP query is already in flight when your migration fires, the migration waits and so does every new query.
Let's take a look at two versions of an ALTER statement and discuss:
-- Triggers a full table rewrite in Postgres < 11 with non-constant defaults
ALTER TABLE orders ADD COLUMN status VARCHAR(50) NOT NULL DEFAULT NOW();
-- Safer: add nullable first, backfill separately, then constrain
ALTER TABLE orders ADD COLUMN status VARCHAR(50);
Adding a column with a constant default, as above, has been safe since Postgres 11. The value is stored in catalog metadata without rewriting the table. A non-constant default, or a NOT NULL constraint without a default, still triggers a full rewrite on older versions. It's very important to know which you're running.
The Expand/Contract Pattern
The fix is a deployment pattern, not a database trick. Expand/Contract breaks schema migrations into three independent phases:
Phase 1: Expand. Add only backward-compatible changes. The rules are:
- New columns are nullable.
- New tables have no foreign key constraints yet.
The old schema still works and you deploy your application alongside this change.
Phase 2: Migrate. Backfill data in batches, not in a single transaction. A 50M row backfill in one transaction holds a lock for the duration and risks a rollback that takes just as long. Instead batch in 1,000-10,000 row chunks with a short sleep between each batch.
Something like:
DO $$
DECLARE
batch_size INT := 5000;
last_id BIGINT := 0;
max_id BIGINT;
BEGIN
SELECT MAX(id) INTO max_id FROM orders;
WHILE last_id < max_id LOOP
UPDATE orders
SET status = 'pending'
WHERE id > last_id
AND id <= last_id + batch_size
AND status IS NULL;
last_id := last_id + batch_size;
PERFORM pg_sleep(0.05);
END LOOP;
END $$;
Phase 3: Contract. Once no application code reads or writes to the old column, drop it. This is another ACCESS EXCLUSIVE operation, but on a now unused column it completes in milliseconds.
Three separate deployments. Three independent release windows. No outage.
Tooling That Helps, and What It Misses
Tooling that version controls your migrations and enforce ordering are great for migration management but unfortunately they do nothing to prevent locking. They will happily run a table-rewriting ALTER TABLE and report success while your connection pool gets exhausted.
The tool pg_repack is great for structural changes. It rebuilds tables and indexes online, without holding ACCESS EXCLUSIVE for the duration. There's a brief lock at the start and finish, with the heavy work in the background.
If you need to change a column type, add a constraint to an existing column, or reclaim bloat, try pg_repack.
For MySQL and Aurora MySQL, pt-online-schema-change and gh-ost use trigger-based shadow table copying to achieve the same sort of result.
Specific Postgres Operations And Their Lock Profiles
Here's a quick rundown of different operations and their lock profiles.
- Adding a nullable column: safe, fast (Postgres 9.6+)
- Adding a column with a constant default: safe and fast (Postgres 11+)
- Adding a column with a volatile default: table rewrite, use Expand/Contract
- Adding
NOT NULLto an existing column: full table scan, useNOT VALIDthenVALIDATE CONSTRAINTseparately - Building a standard index: use
CREATE INDEX CONCURRENTLY, always - Adding a foreign key: share-row-exclusive lock, this is generally fast
Schema Migrations Are Deployment Decisions
The moment you separate "what the code does" from "how the data is shaped" you start making better migration decisions.
Schema changes need release windows and they also need rollback plans. They belong in your deployment runbook, not appended as an afterthought after the application containers are live.
If you're running zero-downtime application deployments but your schema migrations are still fire-and-forget, you've only solved for half the problem.
If you're working through a migration backlog, a schema freeze, or a deployment strategy that has grown too fragile to change safely, book a strategy call and we'll map out a path forward.