Database High Availability Architectures

Keeping a database running isn’t just about reacting when things break. It’s about designing with failure in mind. Every system will go down eventually, so the real question is how well prepared you are when it does.

That’s where high availability (HA) architectures and disaster recovery (DR) strategies come in. They determine whether your downtime is measured in seconds or hours, and whether you lose no data or a whole day’s worth of transactions.

In this post, we’ll walk through the most common database patterns, starting from basic backups all the way to complex, global active-active deployments.

Database High Availability Architectures

Cold Standby

This is the simplest and cheapest disaster recovery option. An environment is provisioned but left turned off, and the database is not running. In the event of a failure, the standby must be powered on and the database restored from backups or brought up to date with shipped logs. Recovery is slow, often hours before the system is fully available.

Cold standby works best for non-critical systems where downtime is acceptable. The trade-off is high recovery time and high data loss risk: you’ll lose everything since the last backup or log transfer.

Warm Standby

This option reduces downtime by keeping the standby environment powered on and continuously updated through replication (log shipping, streaming replication, or similar). The database is online but not serving traffic until it’s promoted.

In a failure, the switch to the standby can be manual or automated, depending on your setup. This approach shortens recovery time compared to cold standby, but you still face some downtime while the system fails over. It’s a common middle ground when you need reasonable recovery but can live with a short interruption.

Pilot Light

This approach is like keeping the smallest possible flame alive so it can be fanned into a full fire. Here, a minimal version of the environment is always running, just enough to be ready. In the event of a disaster, the system is scaled up into a full production environment by adding resources and promoting the standby database.

Pilot light is popular in the cloud because it saves money: you only pay to keep a small baseline alive, and you scale up only when needed. The trade-off is that recovery isn’t instant; while faster than cold standby, it still takes time to scale to full capacity.

Multi-Site Architectures

Multi-site architectures extend high availability across regions or datacenters. Instead of keeping everything in a single location, these setups allow multiple sites to share responsibility for serving traffic. The exact behavior depends on how reads and writes are distributed, and each model comes with different trade-offs in terms of latency, consistency, and complexity.

Read Local – Write Local

In this pattern, all requests are served by the local Region, both reads and writes happen where the user is. This minimizes latency and reduces the chance of network failures because everything stays close to the user.

The challenge is that if two Regions update the same record at the same time, you’ll have write contention. Most systems resolve this by using a “last writer wins” rule, which means the first update is silently overwritten by the second. That may be fine for workloads like shopping carts or user preferences, but it’s unacceptable for financial transactions or systems that demand strong consistency.

Use this when ultra-low latency matters and your application can tolerate eventual consistency or occasional overwrites.

Read Local – Write Global

Here, only one Region is allowed to handle writes, while all other Regions serve as read replicas. Users still get low-latency reads locally, but every write has to be routed back to the global write Region.

This avoids conflicts entirely, since there’s only one source of truth for writes, but it comes at the cost of higher latency for global users who are far from the primary write Region. Some systems add write forwarding to make it look like any Region can accept writes, but in reality those writes are still sent to the global writer behind the scenes.

Use this when consistency is more important than write latency, and when your workload can handle the delay of routing all writes through one Region.

Read Local – Write Partitioned

This pattern assigns each record a “home Region,” where both reads and writes for that record happen. The assignment can be done dynamically (based on where the record was first created) or statically (based on a partition key like user ID or customer geography).

By mapping each record to a home Region, you avoid global write conflicts while still keeping writes close to where they’re most likely to occur. The trade-off is complexity: the application needs to know which Region owns which records and route requests accordingly.

Use this for global, write-heavy workloads where you want low write latency but can tolerate the added complexity of partitioning.

Sharding and Partitioning

Sharding or partitioning, isn’t really a high availability strategy but rather a scalability pattern. In this approach, the data is divided into shards, with each shard managed by a different node. To improve resilience, sharding is often combined with replication so that each partition is protected against failures. A common example would be assigning user IDs 1 to 1 million to one shard, and 1 million to 2 million to another.

Sharding is most useful when you need to scale out beyond the limits of a single database. The downside is that working across shards can be painful. Queries and joins that span multiple partitions quickly become complicated and expensive.

RPO and RTO

Before choosing an architecture, you need to ask two questions:

RPO (Recovery Point Objective): How much data can we afford to lose?

Cold standby – Hours (last backup).
Multi-primary – Seconds or less.

RTO (Recovery Time Objective): How long can we afford to be down?

Cold standby – Hours.
Warm standby – Minutes.
Multi-primary – Seconds.

These two numbers, RPO and RTO, are what really decide which architecture fits your system. If the business says “we can’t lose a single transaction and downtime must be under 1 minute,” then you’re not picking cold standby. You’re looking at multi-primary or another architecture with synchronous replication.

Summary: Database Architectures at a Glance

Cold Standby: One environment off until needed.
Warm Standby: Replica online, promoted on failover.
Pilot Light: Minimal system always running, scaled up on failure.
Multi-Site: Multiple Regions participating in a deployment, with three common read/write patterns:
- Read Local – Write Local: Fastest, but conflicts possible.
- Read Local – Write Global: Safest for consistency, but write latency can be high.
- Read Local – Write Partitioned: Balances latency and consistency by assigning ownership of data to Regions.
Sharding and Partitioning: Data split across nodes for scale.

Final Thoughts

There’s no single “best” database architecture. Every option is a trade-off between cost, complexity, and the balance you strike between RPO and RTO. If the budget is tight, a cold standby or pilot light setup may be enough. For mission-critical systems where downtime is unacceptable, a multi-primary deployment is the stronger choice. And if the priority is fast global reads with a single source of truth for writes, then a read local-write global approach is often the right fit.

The best design is the one that matches the needs of the business, not just technical preferences. At three in the morning, when production is down, the only thing that really matters is whether users can get back online quickly and whether you can rest easy knowing the data is safe.