Commit 9eefd5dd authored by Matteo Melli's avatar Matteo Melli
Browse files

Update failover.md

parent 62273024
Pipeline #88499 passed with stage
in 18 seconds
......@@ -338,37 +338,9 @@ state of the secondary to converge.
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
* Staging: `postgres-01.db.gstg.gitlab.com`
* Production: `postgres-01-db-gprd.c.gitlab-production.internal`
* Create tombstone database and table
* Insert tombstone record and check lag is under 10s. If SR enable 10s can be lowered. Double-check after turning SR on.
* Create tombstone database and table
```shell
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql postgres \
-c "drop database if exists tombstone; create database tombstone"
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql tombstone \
-c "create table if not exists tombstone (created_at timestamptz default now() primary key, note text)"
```
* Insert tombstone record and check lag is under 10s. If SR enable 10s can be lowered. Double-check after turning SR on.
```shell
tombstone_msg=$(date +'%Y%m%d_%H%M%S')"_${ENVIRONMENT}"
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql tombstone -c "insert into tombstone(note) values('${tombstone_msg}') returning *"
# wait until the change is propagated
while true
do
find_new_msg="$(ssh_remote "$GCP_MASTER_CANDIDATE" sudo gitlab-psql -Atd tombstone -c "select created_at from tombstone where note = '$tombstone_msg'")"
if [[ -z "${find_new_msg+x}" ]] || [[ "$find_new_msg" == "" ]]
then
gcp_cur_rep_delay="$(ssh_remote "$GCP_MASTER_CANDIDATE"
sudo gitlab-psql -Atd postgres -c "select round(extract(epoch from (now() - pg_last_xact_replay_timestamp())))")"
echo "New tombstone message is not seen on $GCP_MASTER_CANDIDATE (GCP MASTER CANDIDATE). The replication delay: ${gcp_cur_rep_delay}s. Wait 3 seconds..."
sleep 3
else
echo "New tombstone message arrived to $GCP_MASTER_CANDIDATE."
break
fi
done
```
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
......@@ -419,199 +391,17 @@ of errors while it is being promoted.
- [ ] `*.gitlab.io A 35.185.44.232`
- **DO NOT** change `gitlab.io`.
1. [ ] 🐘 {+ Database-Wrangler +}: Disable chef on all nodes and shut down consul agents and repmgrd
```shell
#Chef
for host in "${AZURE_HOSTS[@]}" "${GCP_HOSTS[@]}"; do
echo "Stopping chef on $host"
ssh_remote "$host" sudo service chef-client stop
ssh_remote "$host" sudo mv /etc/chef /etc/chef.migration
done
#Consul
for host in "${AZURE_PGBOUNCERS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
for host in "${GCP_PGBOUNCERS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
for host in "${AZURE_HOSTS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
for host in "${GCP_HOSTS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
#Repmgr
for host in "${GCP_HOSTS[@]}"
do
echo "Stopping repmgrd on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/repmgrd
done
for host in "${AZURE_HOSTS[@]}"
do
if [ "$AZURE_MASTER" == "$host" ]
then
continue
fi
echo "Stopping repmgrd on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/repmgrd
done
echo "Stopping consul on $AZURE_MASTER"
ssh_remote "$AZURE_MASTER" sudo sv stop /opt/gitlab/sv/repmgrd
# Reset repmgr failover state
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql -d gitlab_repmgr -c \
"TRUNCATE repmgr_gitlab_cluster.repl_nodes"
```
1. [ ] 🐘 {+ Database-Wrangler +}: Convert the currect master (Azure) to a standby.
* Convert the currect master (Azure) to a headless standby (empty `primary_conninfo` and `restore_command`).
```shell
echo "standby_mode = 'on'
recovery_target_timeline = 'latest'" | \
ssh_remote "$AZURE_MASTER" sudo tee /var/lib/opt/gitlab/postgresql/data/recovery.conf
ssh_remote "$AZURE_MASTER" sudo chown postgres:postgres /var/lib/opt/gitlab/postgresql/data/recovery.conf
ssh_remote "$AZURE_MASTER" sudo chmod 600 /var/lib/opt/gitlab/postgresql/data/recovery.conf
# Runit (sv) send SIGTERM (smart shutwodn that wait connections to be closed)
# instead of SIGNINT (fast shutdown that close connections). This workaround
# allow to stop Postgresql faster and cleaner. We issue a sv stop with small
# timeout to prevent runit from try to restart it after SIGINT is sent
ssh_remote "$AZURE_MASTER" sudo sv -W 1 stop /opt/gitlab/sv/postgres \
|| (ssh_remote "$AZURE_MASTER" sudo sv int /opt/gitlab/sv/postgres \
&& ssh_remote "$AZURE_MASTER" sudo sv -W 60 stop /opt/gitlab/sv/postgres)
```
* Convert the currect master (Azure) to a standby pointing to candidate master on GCP.
* Check the database is now read-only
* Wait for the GCP master candidate and previous Azure master (now standby) to have same LSN
```shell
# Then we wait for the GCP master candidate and previous Azure master (now standby) to have same LSN
while true
do
azure_master_lsn="$(ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql postgres \
-t -A -c "select case when pg_is_in_recovery()
then pg_last_xlog_replay_location()
else pg_current_xlog_location() end;")"
gcp_master_candidate_lsn="$(ssh_remote "$GCP_MASTER_CANDIDATE" sudo -u gitlab-psql gitlab-psql postgres \
-t -A -c "select case when pg_is_in_recovery()
then pg_last_xlog_replay_location()
else pg_current_xlog_location() end;")"
if [ "$azure_master_lsn" == "$gcp_master_candidate_lsn" ]
then
echo "GCP and Azure have same LSN: $azure_master_lsn"
return 0
fi
echo "Wait GCP and Azure have same LSN. Current LSNs are: Azure/$azure_master_lsn GCP/$gcp_master_candidate_lsn"
sleep 3
done
```
1. [ ] 🐘 {+ Database-Wrangler +}: Perform regular switchover to the main replica on GCP
```shell
# WARNING WARNING WARNING here switchover happens!
ssh_remote "$GCP_MASTER_CANDIDATE" sudo -u gitlab-psql /opt/gitlab/embedded/bin/pg_ctl \
promote -D /var/lib/opt/gitlab/postgresql/data
```
1. [ ] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
* Connect to the newly promoted primary in GCP (The result should be `f`)
```shell
sudo gitlab-psql -d gitlabhq_production -c "select * from pg_is_in_recovery();"
```
1. [ ] 🐘 {+ Database-Wrangler +}: Start repmgrd on GCP
```shell
#Repmgrd
echo "Register $GCP_MASTER_CANDIDATE as master with repmgr"
ssh_remote "$GCP_MASTER_CANDIDATE" sudo gitlab-ctl repmgr register master
for host in "${GCP_HOSTS[@]}"
do
if [ "$GCP_MASTER_CANDIDATE" == "$host" ]
then
continue;
fi
echo "Register $host as standby with repmgr"
ssh_remote "$host" sudo gitlab-ctl repmgr register standby
done
echo "Starting repmgrd on $GCP_MASTER_CANDIDATE"
ssh_remote "$GCP_MASTER_CANDIDATE" sudo sv start /opt/gitlab/sv/repmgrd
for host in "${GCP_HOSTS[@]}"
do
if [ "$GCP_MASTER_CANDIDATE" == "$host" ]
then
continue;
fi
echo "Starting repmgrd on $host"
ssh_remote "$host" sudo sv start /opt/gitlab/sv/repmgrd
done
```
1. [ ] 🐘 {+ Database-Wrangler +}: Check repmgr master on GCP
```shell
for host in "${GCP_PGBOUNCERS[@]}"
do
echo "Check pgbouncer on $host"
ssh_remote "$host" gitlab-ctl pgb-console -c "SHOW DATABASES"
ssh_remote "$host" gitlab-ctl pgb-console -c "SHOW SERVERS"
read -s -N 1 -p "Press [y] to continue, any other key to abort." key
if [ "$key" != "y" ]
then
return 1
fi
done
```
1. [ ] 🐘 {+ Database-Wrangler +}: Start consul on GCP
```shell
#Consul
for host in "${GCP_HOSTS[@]}"
do
echo "Starting consul agent on $host"
ssh_remote "$host" sudo sv start /opt/gitlab/sv/consul
done
for host in "${GCP_PGBOUNCERS[@]}"
do
echo "Starting consul agent on $host"
ssh_remote "$host" sudo sv start /opt/gitlab/sv/consul
done
```
1. [ ] 🐘 {+ Database-Wrangler +}: Check pgbouncer is connecting on GCP
```shell
for host in "${GCP_PGBOUNCERS[@]}"
do
echo "Check pgbouncer on $host"
ssh_remote "$host" gitlab-ctl pgb-console -c "SHOW DATABASES"
ssh_remote "$host" gitlab-ctl pgb-console -c "SHOW SERVERS"
read -s -N 1 -p "Press [y] to continue, any other key to abort." key
if [ "$key" != "y" ]
then
return 1
fi
done
```
* Check the database is now read-write
1. [ ] 🐘 {+ Database-Wrangler +}: Start repmgrd and consul agents on GCP
* Start repmgrd on GCP
* Check repmgr master on GCP
* Start consul on GCP
* Check pgbouncer is connecting on GCP
1. [ ] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
......@@ -723,8 +513,6 @@ unexpected ways.
#### Phase 8: Reconfiguration, Part 2
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
- [ ] Update in chef cookbooks by removing the setting entirely
- [ ] Update in the running database
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment