2018-06-21 staging failover attempt: failback
Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real" (i.e. before opening it up to the public), it will also be super-useful to have this documented and tested.
The priority is to get the Azure site working again as quickly as possible. As the GCP side will be inaccessible, returning it to operation is of secondary importance.
This issue should not be closed until both Azure and GCP sites are in full working order, including database replication between the two sites.
Fail back to the Azure site
-
↩ ️ Fail-back Handler : Make the GCP environment inaccessible again, if necessary- Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
- Production: ???
-
↩ ️ Fail-back Handler : Update the DNS entries to refer to the Azure load balancer- Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
- Staging: Change
staging.gitlab.com
andregistry.staging.gitlab.com
. They should be A records pointing to40.84.60.110
(fe01.stg.gitlab.com
). - Production: Change
gitlab.com
andregistry.gitlab.com
. They should be A records pointing to52.167.219.168
(???).
-
OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback. -
Start the primary Azure node azure_primary# gitlab-ctl start postgresql
-
Remove nodes from the Azure repmgr cluster. azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done
-
In a tmux or screen session on the Azure standby node, resync the database azure_standby# gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
Note: This can run for several hours. Do not wait for completion.
- [ ] Remove Azure nodes from the GCP cluster by running this on the GCP primary ```shell gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done ```
-
-
Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it Note: Skip this if introducing a postgresql split-brain -
Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: gitlab-ctl stop postgresql
-
Start postgresql on the Azure staging primary node gitlab-ctl start postgresql
-
Ensure gitlab-ctl repmgr cluster show
reports an Azure node as the primary in Azure:gitlab-ctl repmgr cluster show Role | Name | Upstream | Connection String ----------+-------------------------------------------------|------------------------------|------------------------------------------------------------------------------------------------------- * master | postgres02.db.stg.gitlab.com | | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr FAILED | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr FAILED | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr standby | postgres01.db.stg.gitlab.com | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
-
Reinitialize the Azure standby node - Run this in screen / tmux, it can take over an hour. No need to wait for it to complete before continuing
gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
-
-
↩ ️ Fail-back Handler : Verify that the DNS update has propagated back online -
↩ ️ Fail-back Handler : Start sidekiq in Azure- Staging:
knife ssh roles:stg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"
- Production:
knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"
- Staging:
-
↩ ️ Fail-back Handler : Enable access to the azure environment from the outside world
Restore the GCP site to being a working secondary
-
↩ ️ Fail-back Handler : Undo the chef-repo changes from https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989.- If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
- Then,
bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json
-
Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster -
Remove the old data with
rm -rf /var/opt/gitlab/postgresql/data
-
Re-initialize the database by running:
Note: This step can take over an hour. Consider running it in a screen/tmux session.
# su gitlab-psql -c "/opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr" ```
-
Start the database with
gitlab-ctl start postgresql
-
Register the database with the cluster by running
gitlab-ctl repmgr standby register
-
-
↩ ️ Fail-back Handler : Reconfigure every changed gstg node- bundle exec knife ssh roles:gstg-base "sudo chef-client"
-
↩ ️ Fail-back Handler : Clear cache on gstg web nodes to correct broadcast message cachesudo gitlab-rake cache:clear:redis
-
↩ ️ Fail-back Handler : Restart Unicorn and Sidekiqbundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'
bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'
-
↩ ️ Fail-back Handler : Verify database replication is working- Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
-
↩ ️ Fail-back Handler : Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top -
↩ ️ Fail-back Handler : It is now safe to delete the database server snapshots