2018-08-02 STAGING failover attempt: failback
Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real" (i.e. before opening it up to the public), it will also be super-useful to have this documented and tested.
The priority is to get the Azure site working again as quickly as possible. As the GCP side will be inaccessible, returning it to operation is of secondary importance.
This issue should not be closed until both Azure and GCP sites are in full working order, including database replication between the two sites.
Fail back to the Azure site
-
↩ ️ Fail-back Handler : Make the GCP environment inaccessible again, if necessary- Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
- Production: ???
-
↩ ️ Fail-back Handler : Update the DNS entries to refer to the Azure load balancer- Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
- Staging
-
staging.gitlab.com A 40.84.60.110
-
altssh.staging.gitlab.com A 104.46.121.194
-
*.staging.gitlab.io CNAME pages01.stg.gitlab.com
-
- Production
-
gitlab.com A 52.167.219.168
-
altssh.gitlab.com A 52.167.133.162
-
*.gitlab.io A 52.167.214.135
-
-
OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback. -
Start the primary Azure node azure_primary# gitlab-ctl start postgresql
-
Remove nodes from the Azure repmgr cluster. azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done
-
In a tmux or screen session on the Azure standby node, resync the database azure_standby# PGSSLMODE=disable gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
Note: This can run for several hours. Do not wait for completion.
-
Remove Azure nodes from the GCP cluster by running this on the GCP primary gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done
-
-
Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it Note: Skip this if introducing a postgresql split-brain -
Ensure that repmgr priorities for GCP are -1. Run the following on the current primary: # gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=-1 where name like '%gstg%'"
-
Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: gitlab-ctl stop postgresql
-
Start postgresql on the Azure staging primary node gitlab-ctl start postgresql
-
Ensure gitlab-ctl repmgr cluster show
reports an Azure node as the primary in Azure:gitlab-ctl repmgr cluster show Role | Name | Upstream | Connection String ----------+-------------------------------------------------|------------------------------|------------------------------------------------------------------------------------------------------- * master | postgres02.db.stg.gitlab.com | | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr FAILED | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr FAILED | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr standby | postgres01.db.stg.gitlab.com | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
-
Start Azure secondaries - Start postgresql on the Azure staging secondary node
gitlab-ctl start postgresql
- Verify it replicates from the primary. On the primary take a look at
SELECT * FROM pg_stat_replication
which should include the newly started secondary. - Production: Repeat the above for other Azure secondaries. Start one after the other.
- Start postgresql on the Azure staging secondary node
-
-
↩ ️ Fail-back Handler : Verify that the DNS update has propagated back online -
↩ ️ Fail-back Handler : Start sidekiq in Azure- Staging:
knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"
- Production:
knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"
- Staging:
-
↩ ️ Fail-back Handler : Restore the Azure Pages load-balancer configuration- Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
- Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
-
↩ ️ Fail-back Handler : Set the GitLab shared runner timeout back to 3 hours -
↩ ️ Fail-back Handler : Restart automatic incremental GitLab Pages sync- Enable the cronjob on the Azure pages NFS server
-
sudo crontab -e
to get an editor window, uncomment the line involving rsync
-
↩ ️ Fail-back Handler : Update GitLab shared runners to expire jobs after 3 hours- In a Rails console, run:
Ci::Runner.instance_type.update_all(maximum_timeout: 10800)
-
↩ ️ Fail-back Handler : Enable access to the azure environment from the outside world- Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
- Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
Restore the GCP site to being a working secondary
-
↩ ️ Fail-back Handler : Turn the GCP site back into a secondary- Undo the chef-repo changes. If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
- Staging
- https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json
- Production
- https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
bundle exec knife role from file roles/gprd-base-fe-web.json roles/gprd-base.json
-
Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster -
Remove the old data with
rm -rf /var/opt/gitlab/postgresql/data
-
Re-initialize the database by running:
Note: This step can take over an hour. Consider running it in a screen/tmux session.
# su gitlab-psql -c "PGSSLMODE=disable /opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr" ```
-
Start the database with
gitlab-ctl start postgresql
-
Register the database with the cluster by running
gitlab-ctl repmgr standby register
-
-
↩ ️ Fail-back Handler : Reconfigure every changed gstg node- bundle exec knife ssh roles:gstg-base "sudo chef-client"
-
↩ ️ Fail-back Handler : Clear cache on gstg web nodes to correct broadcast message cachesudo gitlab-rake cache:clear:redis
-
↩ ️ Fail-back Handler : Restart Unicorn and Sidekiqbundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'
bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'
-
↩ ️ Fail-back Handler : Verify database replication is working- Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
-
↩ ️ Fail-back Handler : Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top -
↩ ️ Fail-back Handler : Confirm pgbouncer is talking to the correct hostssudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/pgbouncer -U pgbouncer -d pgbouncer -p 6432
- SQL:
SHOW DATABASES;
-
↩ ️ Fail-back Handler : It is now safe to delete the database server snapshots