2018-06-21 staging failover attempt: failback

Failback, discarding changes made to GCP

Since staging is multi-use and we want to run the failover multiple times, we need these steps anyway.

In the event of discovering a problem doing the failover on GitLab.com "for real" (i.e. before opening it up to the public), it will also be super-useful to have this documented and tested.

The priority is to get the Azure site working again as quickly as possible. As the GCP side will be inaccessible, returning it to operation is of secondary importance.

This issue should not be closed until both Azure and GCP sites are in full working order, including database replication between the two sites.

Fail back to the Azure site

↩️ Fail-back Handler : Make the GCP environment inaccessible again, if necessary
1. Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
2. Production: ???
↩️ Fail-back Handler : Update the DNS entries to refer to the Azure load balancer
1. Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
2. Staging: Change staging.gitlab.com and registry.staging.gitlab.com. They should be A records pointing to 40.84.60.110 (fe01.stg.gitlab.com).
3. Production: Change gitlab.com and registry.gitlab.com. They should be A records pointing to 52.167.219.168 (???).

OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback.

Start the primary Azure node

azure_primary# gitlab-ctl start postgresql

Remove nodes from the Azure repmgr cluster.

azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done

In a tmux or screen session on the Azure standby node, resync the database
```
azure_standby# gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
```
Note: This can run for several hours. Do not wait for completion.

- [ ] Remove Azure nodes from the GCP cluster by running this on the GCP primary

    ```shell
    gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done
    ```

Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it Note: Skip this if introducing a postgresql split-brain

Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: gitlab-ctl stop postgresql
Start postgresql on the Azure staging primary node gitlab-ctl start postgresql

Ensure gitlab-ctl repmgr cluster show reports an Azure node as the primary in Azure:

gitlab-ctl repmgr cluster show
Role      | Name                                            | Upstream                     | Connection String
----------+-------------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------
* master  | postgres02.db.stg.gitlab.com                    |                              | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
  FAILED  | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
  FAILED  | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
  standby | postgres01.db.stg.gitlab.com                    | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr

Reinitialize the Azure standby node
- Run this in screen / tmux, it can take over an hour. No need to wait for it to complete before continuing
- gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w

↩️ Fail-back Handler : Verify that the DNS update has propagated back online
↩️ Fail-back Handler : Start sidekiq in Azure
- Staging: knife ssh roles:stg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"
- Production: knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"
↩️ Fail-back Handler : Enable access to the azure environment from the outside world
1. Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029

Restore the GCP site to being a working secondary

↩️ Fail-back Handler : Undo the chef-repo changes from https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989.
1. If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
2. Then, bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json
Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster
1. Remove the old data with rm -rf /var/opt/gitlab/postgresql/data
2. Re-initialize the database by running:
  
  Note: This step can take over an hour. Consider running it in a screen/tmux session.
```
# su gitlab-psql -c "/opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr"
    ```
```
3. Start the database with gitlab-ctl start postgresql
4. Register the database with the cluster by running gitlab-ctl repmgr standby register
↩️ Fail-back Handler : Reconfigure every changed gstg node
1. bundle exec knife ssh roles:gstg-base "sudo chef-client"
↩️ Fail-back Handler : Clear cache on gstg web nodes to correct broadcast message cache
- sudo gitlab-rake cache:clear:redis
↩️ Fail-back Handler : Restart Unicorn and Sidekiq
- bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'
- bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'
↩️ Fail-back Handler : Verify database replication is working
1. Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
↩️ Fail-back Handler : Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top
↩️ Fail-back Handler : It is now safe to delete the database server snapshots

Edited Jun 22, 2018 by Alejandro Rodriguez