Commit eb56736d authored by Matteo Melli's avatar Matteo Melli
Browse files

Adapted migration script to new scripts structure

Enable migration to use script files and script functions.
Created scripts to execute steps inside scripts folder tree.
Merged production and staging steps and added support to specify a role
(associated to steps_$role).
Sourced source_vars into migration script.
parent 4d374cdf
Pipeline #88514 failed with stage
in 47 seconds
......@@ -331,10 +331,8 @@ state of the secondary to converge.
* On staging, verification may not complete
1. [ ] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
* Staging: `postgres-01.db.gstg.gitlab.com`
* Production: `postgres-01-db-gprd.c.gitlab-production.internal`
* Create tombstone database and table
* Insert tombstone record and check lag is under 10s. If SR enable 10s can be lowered. Double-check after turning SR on.
* Create tombstone database and table `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p03/050-create-tombstone-table.sh`
* Insert tombstone record in Azure master and check it arrive on GCP master candidate. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p03/051-check-gcp-replication-delay.sh`
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
......@@ -386,17 +384,22 @@ of errors while it is being promoted.
- [ ] `*.gitlab.io A 35.185.44.232`
- **DO NOT** change `gitlab.io`.
1. [ ] 🐘 {+ Database-Wrangler +}: Disable chef on all nodes and shut down consul agents and repmgrd
* Disable chef on all nodes. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/040-disable-chef.sh`
* Disable consul agents on all nodes. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/041-disable-consul.sh`
* Disable automatic failover on all nodes. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/042-disable-automatic-failover.sh`
* Reset automatic failover state. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/043-reset-automatic-failover-state.sh`
1. [ ] 🐘 {+ Database-Wrangler +}: Convert the currect master (Azure) to a standby.
* Convert the currect master (Azure) to a standby pointing to candidate master on GCP.
* Check the database is now read-only
* Wait for the GCP master candidate and previous Azure master (now standby) to have same LSN
* Convert the currect master (Azure) to a standby pointing to candidate master on GCP. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/050-convert-azure-master-to-standby.sh`
* Check the database is now read-only `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/051-check-azure-master-is-standby.sh`
* Wait for the GCP master candidate and previous Azure master (now standby) to have same LSN `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/052-check-gcp-nodes-has-same-azure-lsn.sh`
1. [ ] 🐘 {+ Database-Wrangler +}: Perform regular switchover to the main replica on GCP
* Check the database is now read-write
* Perform GCP candidate promote. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/060-perform-gcp-candidate-master-promote.sh`
* Check the database is now read-write. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/061-check-gcp-candidate-master-is-master.sh`
1. [ ] 🐘 {+ Database-Wrangler +}: Start repmgrd and consul agents on GCP
* Start repmgrd on GCP
* Check repmgr master on GCP
* Start consul on GCP
* Check pgbouncer is connecting on GCP
* Enable automatic failover on GCP. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/070-enable-automatic-failover-on-gcp-only.sh`
* Check repmgr master on GCP. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/071-check-repmgr-master.sh`
* Start consul on GCP. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/072-enable-consul-on-gcp-only.sh`
* Check pgbouncer is connecting on GCP. `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p04/073-check-pgbouncer-node-in-gcp.sh`
1. [ ] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
......
......@@ -124,8 +124,8 @@
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure repmgr has desired states on all database nodes
* Ensure repmgr has desired states on all database nodes. `/opt/gitlab-migration/bin/scripts/01_preflight/050_configuration_checks/110-check-repmgr-state.sh`
## Ensure Geo replication is up to date
......
#!/bin/bash
shopt -s expand_aliases
alias ssh_remote="ssh "
source env_${1}
ssh_remote "${AZURE_MASTER}" $(cat << EOF
cd /tmp;
sudo -u gitlab-psql gitlab-ctl repmgr cluster show
EOF
)
\ No newline at end of file
#!/bin/bash
shopt -s expand_aliases
alias ssh_remote="ssh "
source env_${1}
ssh_remote "${AZURE_MASTER}" $(cat << EOF
cd /tmp;
sudo -u gitlab-psql gitlab-ctl repmgr cluster show
EOF
)
export steps=(
000_get-rid-of-could-not-change-directory-to-message
001_create-tombstone-table
002_check-gcp-replication-delay
003_disable-chef
004_disable-consul
005_disable-automatic-failover
006_convert-azure-master-to-standby
007_check-gcp-nodes-has-same-azure-lsn
008_perform-gcp-candidate-master-promote
009_check-gcp-candidate-master-is-master
010_enable-automatic-failover-on-gcp-only
011_check-repmgr-master
012_enable-consul-on-gcp-only
013_enable-chef-on-gcp-only
014_restore-could-not-change-directory-to-message
)
function get-rid-of-could-not-change-directory-to-message(){
for host in "$AZURE_HOSTS"
do
ssh_remote "$host" bash -c 'chmod o+x "$HOME"'
done
}
function create-tombstone-table(){
echo "Create tombstone database and table if not already existing"
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql postgres \
-c "drop database if exists tombstone; create database tombstone"
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql tombstone \
-c "create table if not exists tombstone (created_at timestamptz default now() primary key, note text)"
}
function check-gcp-replication-delay(){
tombstone_msg=$(date +'%Y%m%d_%H%M%S')"_${ENVIRONMENT}"
echo "Insert '$tombstone_msg' into tombstone"
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql tombstone -c "insert into tombstone(note) values('${tombstone_msg}') returning *"
# wait until the change is propagated
while true
do
find_new_msg="$(ssh_remote "$GCP_MASTER_CANDIDATE" sudo gitlab-psql -Atd tombstone -c "select created_at from tombstone where note = '$tombstone_msg'")"
if [[ -z ${find_new_msg+x} ]] || [[ "$find_new_msg" == "" ]]
then
gcp_cur_rep_delay="$(ssh_remote "$GCP_MASTER_CANDIDATE"
sudo gitlab-psql -Atd postgres -c "select round(extract(epoch from (now() - pg_last_xact_replay_timestamp())))")"
echo "New tombstone message is not seen on $GCP_MASTER_CANDIDATE (GCP MASTER CANDIDATE). The replication delay: ${gcp_cur_rep_delay}s. Wait 3 seconds..."
sleep 3
else
echo "New tombstone message arrived to $GCP_MASTER_CANDIDATE."
break
fi
done
}
function disable-chef(){
for host in "${AZURE_HOSTS[@]}" "${GCP_HOSTS[@]}"; do
echo "Stopping chef on $host"
ssh_remote "$host" sudo service chef-client stop
ssh_remote "$host" sudo mv /etc/chef /etc/chef.migration
done
}
function disable-consul(){
for host in "${AZURE_PGBOUNCERS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
for host in "${GCP_PGBOUNCERS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
for host in "${AZURE_HOSTS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
for host in "${GCP_HOSTS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
}
function disable-automatic-failover(){
for host in "${GCP_HOSTS[@]}"
do
echo "Stopping repmgrd on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/repmgrd
done
for host in "${AZURE_HOSTS[@]}"
do
if [ "$AZURE_MASTER" == "$host" ]
then
continue
fi
echo "Stopping repmgrd on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/repmgrd
done
echo "Stopping consul on $AZURE_MASTER"
ssh_remote "$AZURE_MASTER" sudo sv stop /opt/gitlab/sv/repmgrd
}
function convert-azure-master-to-standby(){
echo "standby_mode = 'on'
recovery_target_timeline = 'latest'" | \
ssh_remote "$AZURE_MASTER" sudo tee /var/lib/opt/gitlab/postgresql/data/recovery.conf
ssh_remote "$AZURE_MASTER" sudo chown postgres:postgres /var/lib/opt/gitlab/postgresql/data/recovery.conf
ssh_remote "$AZURE_MASTER" sudo chmod 600 /var/lib/opt/gitlab/postgresql/data/recovery.conf
ssh_remote "$AZURE_MASTER" sudo sv -W 1 stop /opt/gitlab/sv/postgres \
|| (ssh_remote "$AZURE_MASTER" sudo sv int /opt/gitlab/sv/postgres \
&& ssh_remote "$AZURE_MASTER" sudo sv -W 60 stop /opt/gitlab/sv/postgres)
}
function check-gcp-nodes-has-same-azure-lsn(){
while true
do
azure_master_lsn="$(ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql postgres \
-t -A -c "select case when pg_is_in_recovery()
then pg_last_xlog_replay_location()
else pg_current_xlog_location() end;")"
gcp_master_candidate_lsn="$(ssh_remote "$GCP_MASTER_CANDIDATE" sudo -u gitlab-psql gitlab-psql postgres \
-t -A -c "select case when pg_is_in_recovery()
then pg_last_xlog_replay_location()
else pg_current_xlog_location() end;")"
if [ "$azure_master_lsn" == "$gcp_master_candidate_lsn" ]
then
echo "GCP and Azure have same LSN: $azure_master_lsn"
return 0
fi
echo "GCP and Azure have different LSN: Azure/$azure_master_lsn GCP/$gcp_master_candidate_lsn"
sleep 3
done
}
function perform-gcp-candidate-master-promote(){
ssh_remote "$GCP_MASTER_CANDIDATE" sudo -u gitlab-psql /opt/gitlab/embedded/bin/pg_ctl \
promote -D /var/lib/opt/gitlab/postgresql/data
}
function check-gcp-candidate-master-is-master(){
if ssh_remote "$GCP_MASTER_CANDIDATE" sudo -u gitlab-psql gitlab-psql postgres \
-t -A -c "select pg_is_in_recovery()" | grep -q 'f'
then
echo "$GCP_MASTER_CANDIDATE is master"
return 0
else
>&2 echo "$GCP_MASTER_CANDIDATE is master"
return 1
fi
}
function enable-automatic-failover-on-gcp-only(){
echo "Register $GCP_MASTER_CANDIDATE as master with repmgr"
ssh_remote "$GCP_MASTER_CANDIDATE" sudo gitlab-ctl repmgr register master
for host in "${GCP_HOSTS[@]}"
do
if [ "$GCP_MASTER_CANDIDATE" == "$host" ]
then
continue;
fi
echo "Register $host as standby with repmgr"
ssh_remote "$host" sudo gitlab-ctl repmgr register standby
done
echo "Starting repmgrd on $GCP_MASTER_CANDIDATE"
ssh_remote "$GCP_MASTER_CANDIDATE" sudo sv stop /opt/gitlab/sv/repmgrd
for host in "${GCP_HOSTS[@]}"
do
if [ "$GCP_MASTER_CANDIDATE" == "$host" ]
then
continue;
fi
echo "Starting repmgrd on $host"
ssh_remote "$host" sudo sv start /opt/gitlab/sv/repmgrd
done
}
function enable-consul-on-gcp-only(){
for host in "${GCP_HOSTS[@]}"
do
echo "Starting consul agent on $host"
ssh_remote "$host" sudo sv start /opt/gitlab/sv/consul
done
for host in "${GCP_PGBOUNCERS[@]}"
do
echo "Starting consul agent on $host"
ssh_remote "$host" sudo sv start /opt/gitlab/sv/consul
done
}
function enable-chef-on-gcp-only(){
# chef
for host in "${AZURE_HOSTS[@]}" "${GCP_HOSTS[@]}"; do
echo "Starting chef-client on $host"
ssh_remote "$host" sudo mv /etc/chef.migration /etc/chef
ssh_remote "$host" sudo service chef-client start
done
}
function check-repmgr-master(){
echo "Checking state of $GCP_MASTER_CANDIDATE"
if ssh_remote "$GCP_MASTER_CANDIDATE" sudo -u gitlab-consul gitlab-ctl repmgr-check-master 2> /dev/null
then
echo "$GCP_MASTER_CANDIDATE is repmgr master"
return 0
else
>&2 echo "$GCP_MASTER_CANDIDATE is not repmgr master"
return 1
fi
}
function restore-could-not-change-directory-to-message(){
for host in "$AZURE_HOSTS"
do
ssh_remote "$host" bash -c 'chmod o-x "$HOME"'
done
}
#!/bin/bash
export steps=(000_step1 001_step2)
function step1(){
echo "Running things inside"
}
function step2(){
echo "Running things inside"
}
#!/bin/bash
set -eu
[[ $# -lt 1 ]] && { echo "Specify the environment"; exit 1 ; }
[[ $# -lt 2 ]] && { echo "Specify the environment and role"; exit 1 ; }
# Because some of the script use one or another, we clone both,
# although we need to homogenize
export ENVIRONMENT=$1
export ROLE=$2
export GITLAB_ENV=$ENVIRONMENT
source source_vars
source env_${ENVIRONMENT} # That is, .env_staging or .env_production (test also supported)
source utilities
source steps_${ENVIRONMENT}
source steps_${ROLE}
#Check all steps have a script
for step in "${steps[@]}"
do
if ! type "$(step_script "$step")" > /dev/null 2>&1
if ! step_check "$step"
then
>&2 echo "Function $(step_script "$step") do not exists for step $(step_3digit_number "$step")"
exit 1
fi
done
echo "menu"
do_menu
#!/bin/bash
set -eu
for host in "${AZURE_HOSTS[@]}" "${GCP_HOSTS[@]}"
do
echo "Checking repmgr state for host $host"
echo
ssh_remote "$host" sudo -u gitlab-psql gitlab-ctl repmgr cluster show
echo
done
#!/bin/bash
set -eu
echo "Create tombstone database and table if not already existing"
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql postgres \
-c "drop database if exists tombstone; create database tombstone"
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql tombstone \
-c "create table if not exists tombstone (created_at timestamptz default now() primary key, note text)"
#!/bin/bash
set -eu
tombstone_msg=$(date +'%Y%m%d_%H%M%S')"_${ENVIRONMENT}"
echo "Insert '$tombstone_msg' into tombstone"
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql tombstone -c "insert into tombstone(note) values('${tombstone_msg}') returning *"
# wait until the change is propagated
while true
do
find_new_msg="$(ssh_remote "$GCP_MASTER_CANDIDATE" sudo gitlab-psql -Atd tombstone -c "select created_at from tombstone where note = '$tombstone_msg'")"
if [[ -z "${find_new_msg+x}" ]] || [[ "$find_new_msg" == "" ]]
then
gcp_cur_rep_delay="$(ssh_remote "$GCP_MASTER_CANDIDATE"
sudo gitlab-psql -Atd postgres -c "select round(extract(epoch from (now() - pg_last_xact_replay_timestamp())))")"
echo "New tombstone message is not seen on $GCP_MASTER_CANDIDATE (GCP MASTER CANDIDATE). The replication delay: ${gcp_cur_rep_delay}s. Wait 3 seconds..."
sleep 3
else
echo "New tombstone message arrived to $GCP_MASTER_CANDIDATE."
break
fi
done
#!/bin/bash
set -eu
for host in "${AZURE_HOSTS[@]}" "${GCP_HOSTS[@]}"; do
echo "Stopping chef on $host"
ssh_remote "$host" sudo service chef-client stop
ssh_remote "$host" sudo mv /etc/chef /etc/chef.migration
done
\ No newline at end of file
#!/bin/bash
set -eu
for host in "${AZURE_PGBOUNCERS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
for host in "${GCP_PGBOUNCERS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
for host in "${AZURE_HOSTS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
for host in "${GCP_HOSTS[@]}"
do
echo "Stopping consul on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/consul
done
#!/bin/bash
set -eu
for host in "${GCP_HOSTS[@]}"
do
echo "Stopping repmgrd on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/repmgrd
done
for host in "${AZURE_HOSTS[@]}"
do
if [ "$AZURE_MASTER" == "$host" ]
then
continue
fi
echo "Stopping repmgrd on $host"
ssh_remote "$host" sudo sv stop /opt/gitlab/sv/repmgrd
done
echo "Stopping repmgrd on $AZURE_MASTER"
ssh_remote "$AZURE_MASTER" sudo sv stop /opt/gitlab/sv/repmgrd
#!/bin/bash
set -eu
ssh_remote "$AZURE_MASTER" sudo -u gitlab-psql gitlab-psql -d gitlab_repmgr -c \
"TRUNCATE repmgr_gitlab_cluster.repl_nodes"
#!/bin/bash
set -eu
echo "standby_mode = 'on'
primary_conninfo = 'user=gitlab_repmgr host=''$GCP_MASTER_CANDIDATE'' password=$GITLAB_REPMGR_PASSWORD port=5432 fallback_application_name=repmgr sslmode=prefer sslcompression=1 application_name=''$AZURE_MASTER'''
primary_slot_name = secondary_azureprd
restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch -p 32 "%f" "%p"'
recovery_target_timeline = 'latest'" | \
ssh_remote "$AZURE_MASTER" sudo tee /var/lib/opt/gitlab/postgresql/data/recovery.conf > /dev/null
ssh_remote "$AZURE_MASTER" sudo chown postgres:postgres /var/lib/opt/gitlab/postgresql/data/recovery.conf
ssh_remote "$AZURE_MASTER" sudo chmod 600 /var/lib/opt/gitlab/postgresql/data/recovery.conf
ssh_remote "$AZURE_MASTER" sudo sv -W 1 stop /opt/gitlab/sv/postgres \
|| (ssh_remote "$AZURE_MASTER" sudo sv int /opt/gitlab/sv/postgres \
&& ssh_remote "$AZURE_MASTER" sudo sv -W 60 stop /opt/gitlab/sv/postgres)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment