MySQL Automated Failover with Orchestrator

MySQL replication gives you redundancy — replicas hold copies of the primary's data and can serve reads. But replication alone doesn't give you high availability. When the primary fails, something needs to detect the failure, pick the best replica to promote, re-point the other replicas at the new primary, and update whatever is routing application traffic. Done manually, that process takes 10–30 minutes. Done wrong, it causes data loss.

Orchestrator is the tool the industry uses to automate this. Originally built at GitHub and now open source, it continuously maps your replication topology, detects failures, and executes failover safely.

What Orchestrator Does

Orchestrator connects to your MySQL instances via a dedicated account and continuously polls them to build a live map of the replication topology. It knows which server is the primary, which are replicas, how far behind each replica is, and the GTID state of every node.

When a primary disappears, Orchestrator:

Waits for a configurable detection period to avoid flapping on brief network hiccups.
Picks the most up-to-date replica (lowest replication lag, highest GTID set).
Catches up any other replicas against the promoted node.
Runs your hook scripts — where you update ProxySQL, DNS, or Consul to redirect application traffic.
Marks the old primary as crashed and removes it from the topology.

Installation

# Download the latest release (replace version as needed)
wget https://github.com/openark/orchestrator/releases/download/v3.2.6/orchestrator-3.2.6-linux-amd64.tar.gz
tar xzf orchestrator-3.2.6-linux-amd64.tar.gz -C /usr/local/

# Or via package on Debian/Ubuntu
wget https://github.com/openark/orchestrator/releases/download/v3.2.6/orchestrator_3.2.6_amd64.deb
dpkg -i orchestrator_3.2.6_amd64.deb

systemctl enable --now orchestrator

Orchestrator stores its topology data in its own backend database — MySQL or SQLite for single-node setups.

Minimal Configuration

Edit /etc/orchestrator/orchestrator.conf.json:

{
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "strong_password",
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orchestrator",
  "MySQLOrchestratorPassword": "strong_password",

  "RecoveryPeriodBlockSeconds": 3600,
  "RecoverMasterClusterFilters": ["*"],
  "FailureDetectionPeriodBlockMinutes": 60,

  "OnFailureDetectionProcesses": [
    "echo 'Failure detected: {failureType} on {failedHost}' >> /var/log/orchestrator-events.log"
  ],
  "PostMasterFailoverProcesses": [
    "/usr/local/bin/failover-hook.sh {successorHost} {successorPort}"
  ]
}

Create the topology user on every MySQL instance in your cluster:

CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'strong_password';
GRANT SUPER, PROCESS, REPLICATION SLAVE, RELOAD ON *.* TO 'orchestrator'@'%';
GRANT SELECT ON mysql.slave_master_info TO 'orchestrator'@'%';

Discovering Your Topology

Point Orchestrator at your primary and it will crawl the rest of the topology automatically:

# Via CLI
orchestrator-client -c discover -i primary-host:3306

# Check what it found
orchestrator-client -c topology -i primary-host:3306

The topology output shows the full replication tree with lag and GTID status for each node. The web UI (port 3000 by default) renders this as an interactive graph.

Automated vs. Manual Failover

Orchestrator supports both modes. Automated failover is controlled by RecoverMasterClusterFilters in the config. Setting it to ["*"] enables auto-recovery for all clusters. You can limit it to specific cluster names if you want manual control over some topologies.

For a manual failover — planned maintenance, for example — use:

# Graceful primary switch (primary is healthy, you're choosing to move)
orchestrator-client -c graceful-master-takeover-auto -i primary-host:3306

# Force promotion of a specific replica
orchestrator-client -c force-master-failover -i primary-host:3306

The Failover Hook

The most important part of your Orchestrator setup is the hook script that runs after a failover. This is where you update the rest of your stack to point at the new primary. A typical hook for a ProxySQL + Orchestrator setup:

#!/bin/bash
# /usr/local/bin/failover-hook.sh
# Orchestrator passes variables as positional args: {successorHost} {successorPort}
NEW_MASTER=$1
NEW_PORT=$2

# Update ProxySQL to point writes at the new primary
mysql -u admin -padmin -h 127.0.0.1 -P 6032 <> /var/log/orchestrator-events.log

Monitoring the Topology

Check overall cluster health at any time:

# List all known clusters
orchestrator-client -c clusters

# List replicas of a given instance
orchestrator-client -c which-replicas -i primary-host:3306

# Check if a specific instance is replicating OK
orchestrator-client -c instance -i replica-host:3306 | jq '.ReplicationLagSeconds, .Slave_SQL_Running'

Orchestrator also exposes a REST API and a metrics endpoint that integrates with Prometheus, so you can alert on replication lag or unhealthy nodes alongside your other infrastructure metrics.

Conclusion

Orchestrator turns MySQL failover from a stressful manual procedure into a well-defined automated process. The key to making it work reliably is the hook script — that's where your specific topology's routing logic lives, whether you're using ProxySQL, HAProxy, Route 53, or Consul. Get the hook right, test it against a planned failover, and you'll have a system that can recover from a primary failure in under two minutes without waking anyone up.

Questions about Orchestrator configuration or integrating it with your routing layer? Reach out.