2018-08-09 STAGING switchover attempt: failback

Failback, discarding changes made to GCP

Since staging is multi-use and we want to run the failover multiple times, we need these steps anyway.

In the event of discovering a problem doing the failover on GitLab.com "for real" (i.e. before opening it up to the public), it will also be super-useful to have this documented and tested.

The priority is to get the Azure site working again as quickly as possible. As the GCP side will be inaccessible, returning it to operation is of secondary importance.

This issue should not be closed until both Azure and GCP sites are in full working order, including database replication between the two sites.

Fail back to the Azure site

↩️ Fail-back Handler : Make the GCP environment inaccessible again, if necessary
1. Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
2. Production: ???
↩️ Fail-back Handler : Update the DNS entries to refer to the Azure load balancer
1. Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
2. Staging
  - staging.gitlab.com A 40.84.60.110
  - altssh.staging.gitlab.com A 104.46.121.194
  - *.staging.gitlab.io CNAME pages01.stg.gitlab.com
3. Production
  - gitlab.com A 52.167.219.168
  - altssh.gitlab.com A 52.167.133.162
  - *.gitlab.io A 52.167.214.135
OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback.
- Start the primary Azure node
```
azure_primary# gitlab-ctl start postgresql
```
- Remove nodes from the Azure repmgr cluster.
```
azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
- In a tmux or screen session on the Azure standby node, resync the database
```
azure_standby# PGSSLMODE=disable gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
```
  Note: This can run for several hours. Do not wait for completion.
- Remove Azure nodes from the GCP cluster by running this on the GCP primary
```
gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```

Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it Note: Skip this if introducing a postgresql split-brain

Ensure that repmgr priorities for GCP are -1. Run the following on the current primary:

# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=-1 where name like '%gstg%'"

Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: gitlab-ctl stop postgresql
Start postgresql on the Azure staging primary node gitlab-ctl start postgresql

Ensure gitlab-ctl repmgr cluster show reports an Azure node as the primary in Azure:

gitlab-ctl repmgr cluster show
Role      | Name                                            | Upstream                     | Connection String
----------+-------------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------
* master  | postgres02.db.stg.gitlab.com                    |                              | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
  FAILED  | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
  FAILED  | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
  standby | postgres01.db.stg.gitlab.com                    | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr

Start Azure secondaries
- Start postgresql on the Azure staging secondary node gitlab-ctl start postgresql
- Verify it replicates from the primary. On the primary take a look at SELECT * FROM pg_stat_replication which should include the newly started secondary.
- Production: Repeat the above for other Azure secondaries. Start one after the other.

↩️ Fail-back Handler : Verify that the DNS update has propagated back online
↩️ Fail-back Handler : Start sidekiq in Azure
- Staging: knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"
- Production: knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"
↩️ Fail-back Handler : Restore the Azure Pages load-balancer configuration
- Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
- Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
↩️ Fail-back Handler : Set the GitLab shared runner timeout back to 3 hours
↩️ Fail-back Handler : Restart automatic incremental GitLab Pages sync
- Enable the cronjob on the Azure pages NFS server
- sudo crontab -e to get an editor window, uncomment the line involving rsync
↩️ Fail-back Handler : Update GitLab shared runners to expire jobs after 3 hours
- In a Rails console, run:
- Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)
↩️ Fail-back Handler : Enable access to the azure environment from the outside world
- Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
- Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254

Restore the GCP site to being a working secondary

↩️ Fail-back Handler : Turn the GCP site back into a secondary
- Undo the chef-repo changes. If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
- Staging
  - https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
  - bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json
- Production
  - https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
  - bundle exec knife role from file roles/gprd-base-fe-web.json roles/gprd-base.json
Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster
1. Remove the old data with rm -rf /var/opt/gitlab/postgresql/data
2. Re-initialize the database by running:
  
  Note: This step can take over an hour. Consider running it in a screen/tmux session.
```
# su gitlab-psql -c "PGSSLMODE=disable /opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr"
    ```
```
3. Start the database with gitlab-ctl start postgresql
4. Register the database with the cluster by running gitlab-ctl repmgr standby register
↩️ Fail-back Handler : Reconfigure every changed gstg node
1. bundle exec knife ssh roles:gstg-base "sudo chef-client"
↩️ Fail-back Handler : Clear cache on gstg web nodes to correct broadcast message cache
- sudo gitlab-rake cache:clear:redis
↩️ Fail-back Handler : Restart Unicorn and Sidekiq
- bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'
- bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'
↩️ Fail-back Handler : Verify database replication is working
1. Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
↩️ Fail-back Handler : Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top
↩️ Fail-back Handler : Confirm pgbouncer is talking to the correct hosts
- sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/pgbouncer -U pgbouncer -d pgbouncer -p 6432
- SQL: SHOW DATABASES;
↩️ Fail-back Handler : It is now safe to delete the database server snapshots

Edited Aug 10, 2018 by Mek Stittri