2018-08-08 STAGING failover attempt: main procedure

Failover Team

Role	Assigned To
🐺 Coordinator	TEAM_COORDINATOR
🔪 Chef-Runner	TEAM_CHEF_RUNNER
☎ Comms-Handler	TEAM_COMMS_HANDLER
🐘 Database-Wrangler	TEAM_DATABASE_WRANGLER
☁ Cloud-conductor	TEAM_CLOUD_CONDUCTOR
🏆 Quality	TEAM_QUALITY
↩ Fail-back Handler (Staging Only)	TEAM_FAILBACK_HANDLER
🎩 Head Honcho (Production Only)	TEAM_HEAD_HONCHO

(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)

Immediately

Perform these steps when the issue is created.

🐺 Coordinator : Fill out the names of the failover team in the table above.
🐺 Coordinator : Fill out dates/times and links in this issue:
- Start Time: __MAINTENANCE_START_TIME__ & End Time: __MAINTENANCE_END_TIME__
- Google Working Doc: GOOGLE_DOC_URL (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- PRODUCTION ONLY Blog Post: BLOG_POST_URL
- PRODUCTION ONLY End Time: MAINTENANCE_END_TIME

Support Options

Provider	Plan	Details	Create Ticket
Microsoft Azure	Profession Direct Support	24x7, email & phone, 1 hour turnaround on Sev A	Create Azure Support Ticket
Google Cloud Platform	Gold Support	24x7, email & phone, 1hr response on critical issues	Create GCP Support Ticket

Database hosts

Staging

graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];

Production

graph TD;
  postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
  postgres02a --> postgres03a["postgres-03.db.prd"];
  postgres02a --> postgres04a["postgres-04.db.prd"];
  postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
  postgres01g --> postgres02g["postgres-02-db-gprd"];
  postgres01g --> postgres03g["postgres-03-db-gprd"];
  postgres01g --> postgres04g["postgres-04-db-gprd"];

Console hosts

The usual rails and database console access hosts are broken during the failover. Any shell commands should, instead, be run on the following machines by SSHing to them. Rails console commands should also be run on these machines, by SSHing to them and issuing a sudo gitlab-rails console command first.

Staging:
- Azure: web-01.sv.stg.gitlab.com
- GCP: web-01-sv-gstg.c.gitlab-staging-1.internal
Production:
- Azure: web-01.sv.prd.gitlab.com
- GCP: web-01-sv-gprd.c.gitlab-production.internal

Dashboards and debugging

These dashboards might be useful during the failover:
- Staging:
  - Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
  - GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
- Production:
  - Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
  - GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
Sentry includes application errors. At present, Azure and GCP log to the same Sentry instance
- Staging: https://sentry.gitlap.com/gitlab/staginggitlabcom/
- Production:
  - Workhorse: https://sentry.gitlap.com/gitlab/gitlab-workhorse-gitlabcom/
  - Rails (backend): https://sentry.gitlap.com/gitlab/gitlabcom/
  - Rails (frontend): https://sentry.gitlap.com/gitlab/gitlabcom-clientside/
  - Gitaly (golang): https://sentry.gitlap.com/gitlab/gitaly-production/
  - Gitaly (ruby): https://sentry.gitlap.com/gitlab/gitlabcom-gitaly-ruby/
The logs can be used to inspect any area of the stack in more detail
- https://log.gitlab.net/

PRODUCTION ONLY T minus 3 weeks (Date TBD) 📁

Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
Ensure this issue has been created on dev.gitlab.org, since gitlab.com will be unavailable during the real failover!!!

PRODUCTION ONLY T minus 1 week (Date TBD) 📁

🔪 Chef-Runner : Scale up the gprd fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
☎ Comms-Handler : communicate date to Google
☎ Comms-Handler : announce in #general slack and on team call date of failover.
☎ Comms-Handler : Marketing team publish blog post about upcoming GCP failover
☎ Comms-Handler : Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
☎ Comms-Handler : Ensure that YouTube stream will be available for Zoom call
☎ Comms-Handler : Tweet blog post from @gitlab and @gitlabstatus
- Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from __MAINTENANCE_START_TIME__ - __MAINTENANCE_END_TIME__ UTC. Follow @gitlabstatus for more details. __BLOG_POST_URL__
🔪 Chef-Runner : Ensure the GCP environment is inaccessible to the outside world

T minus 1 day (Date TBD) 📁

🐺 Coordinator : Perform (or coordinate) Preflight Checklist
PRODUCTION ONLY ☎ Comms-Handler : Tweet from @gitlab.
- Tweet content from /opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh
PRODUCTION ONLY ☎ Comms-Handler : Retweet @gitlab tweet from @gitlabstatus with further details
- Tweet content from /opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh

T minus 3 hours (FAILOVER_DATE) 📁

STAGING FAILOVER TESTING ONLY: to speed up testing, this step can be done less than 3 hours before failover

🐺 Coordinator : Update GitLab shared runners to expire jobs after 1 hour
- In a Rails console, run:
- Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 3600)

T minus 1 hour (FAILOVER_DATE) 📁

STAGING FAILOVER TESTING ONLY: to speed up testing, this step can be done less than 1 hour before failover

GitLab runners attempting to post artifacts back to GitLab.com during the maintenance window will fail and the artifacts may be lost. To avoid this as much as possible, we'll stop any new runner jobs from being picked up, starting an hour before the scheduled maintenance window.

PRODUCTION ONLY ☎ Comms-Handler : Tweet from @gitlabstatus
- As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__
☎ Comms-Handler : Post to #announcements on Slack:
- /opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh
PRODUCTION ONLY ☁ Cloud-conductor : Create a maintenance window in PagerDuty for GitLab Production service for 2 hours starting in an hour from now.
PRODUCTION ONLY ☁ Cloud-conductor : Create an alert silence for 2 hours starting in an hour from now with the following matcher(s):
- environment: prd
PRODUCTION ONLY 🔪 Chef-Runner : Silence production alerts
- Create an alert silence for 3 hours (starting now) with the following matchers:
  - provider: azure
  - alertname: High4xxApiRateLimit|High4xxRateLimit, check "Regex"
🔪 Chef-Runner : Stop any new GitLab CI jobs from being executed
- Block POST /api/v4/jobs/request
- Staging
  - https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
  - knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'
- Production
  - https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
  - knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'

☎ Comms-Handler : Create a broadcast message
- Staging: https://staging.gitlab.com/admin/broadcast_messages
- Production: https://gitlab.com/admin/broadcast_messages
- Text: gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from __MAINTENANCE_START_TIME__ on __FAILOVER_DATE__ UTC
- Start date: now
- End date: now + 3 hours

☁ Cloud-conductor : Initial snapshot of database disks in case of failback in Azure and GCP
- Staging: bin/snapshot-dbs staging
- Production: bin/snapshot-dbs production
🔪 Chef-Runner : Stop automatic incremental GitLab Pages sync
- Disable the cronjob on the Azure pages NFS server
- This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
- sudo crontab -e to get an editor window, comment out the line involving a pages-sync script

🔪

Chef-Runner : Start parallelized, incremental GitLab Pages sync

Expected to take ~30 minutes, run in screen/tmux! On the Azure pages NFS server!
Updates to pages after the transfer starts will be lost.
The user running the rsync must have full sudo access on both azure and gcp pages.
Very manual, looks a little like the following at present:

Staging:

ssh 10.133.2.161 # nfs-pages-staging-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages

Production:

ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages

Failover Call

These steps will be run in a Zoom call. The 🐺 Coordinator runs the call, asking other roles to perform each step on the checklist at the appropriate time.

Changes are made one at a time, and verified before moving onto the next step. Whoever is performing a change should share their screen and explain their actions as they work through them. Everyone else should watch closely for mistakes or errors! A few things to keep an especially sharp eye out for:

Exposed credentials (except short-lived items like 2FA codes)
Running commands against the wrong hosts
Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)

Remember that the intention is for the call to be broadcast live on the day. If you see something happening that shouldn't be public, mention it.

Roll call

🐺 Comms-Handler : make sure Youtube stream is started
🐺 Coordinator : Ensure everyone mentioned above is on the call
🐺 Coordinator : Ensure the Zoom room host is on the call

Notify Users of Maintenance Window

PRODUCTION ONLY ☎ Comms-Handler : Tweet from @gitlabstatus
- GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube
PRODUCTION ONLY ☎ Comms-Handler : Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- GitLab.com planned maintenance for migration to @GCPcloud is starting. See you on the other side! We'll be live on YouTube

Monitoring

🐺 Coordinator : To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
- Staging: watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io
- Production: watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe0{1..9}.lb.gitlab.com fe{10..16}.lb.gitlab.com altssh0{1..2}.lb.gitlab.com

Health check

🐺 Coordinator : Ensure that there are no active alerts on the azure or gcp environment.
- Staging
  - GCP gstg: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
  - Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
- Production
  - GCP gprd: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
  - Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D

T minus zero (failover day) (FAILOVER_DATE) 📁

We expect the maintenance window to last for up to 2 hours, starting from now.

Failover Procedure

Prevent updates to the primary

Phase 1: Block non-essential network access to the primary 📁

🔪 Chef-Runner : Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
- Staging
  - Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
  - Run knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'
  - Run knife ssh roles:staging-base-fe-git 'sudo chef-client'
- Production:
  - Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
  - Run knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'
  - Run knife ssh roles:gitlab-base-fe-git 'sudo chef-client'
🔪 Chef-Runner : Restart HAProxy on all LBs to terminate any on-going connections
- This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
- Staging: knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'
- Production: knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'
🐺 Coordinator : Ensure traffic from a non-VPN IP is blocked
- Check the non-VPN hostinfo output and verify that the SSH column reads No and the REDIRECT column shows it being redirected to the migration blog post

Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.

Phase 2: Commence Shutdown in Azure 📁

🔪 Chef-Runner : Stop mailroom on all the nodes
- /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh
🔪 Chef-Runner PRODUCTION ONLY: Stop sidekiq-pullmirror in Azure
- /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh
🐺 Coordinator : Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
- In a separate terminal on the deploy host: /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh
- The geo_sidekiq_cron_config job or an RSS kill may re-enable the crons, which is why we run it in a loop
- The loop should be stopped once sidekiq is shut down
- Wait for --> Status: PROCEED
🐺 Coordinator : Wait for repository verification on the primary to complete
- Staging: https://staging.gitlab.com/admin/geo_nodes - staging.gitlab.com node
- Production: https://gitlab.com/admin/geo_nodes - gitlab.com node
- Expand the Verification Info tab
- Wait for the number of unverified repositories to reach 0
- Resolve any repositories that have failed verification
🐺 Coordinator : Wait for all Sidekiq jobs to complete on the primary
- Staging: https://staging.gitlab.com/admin/background_jobs
- Production: https://gitlab.com/admin/background_jobs
- Press Queues -> Live Poll
- Wait for all queues not mentioned above to reach 0
- Wait for the number of Enqueued and Busy jobs to reach 0
- On staging, the repository verification queue may not empty
🐺 Coordinator : Handle Sidekiq jobs in the "retry" state
- Staging: https://staging.gitlab.com/admin/sidekiq/retries
- Production: https://gitlab.com/admin/sidekiq/retries
- NOTE: This tab may contain confidential information. Do this out of screen capture!
- Delete jobs in idempotent or transient queues (reactive_caching or repository_update_remote_mirror, for instance)
- Delete jobs in other queues that are failing due to application bugs (error contains NoMethodError, for instance)
- Press "Retry All" to attempt to retry all remaining jobs immediately
- Repeat until 0 retries are present
🔪 Chef-Runner : Stop sidekiq in Azure
- Staging: knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"
- Production: knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"
- Check that no sidekiq processes show in the GitLab admin panel
🐺 Coordinator : Stop the Sidekiq queue disabling loop from above

At this point, the primary can no longer receive any updates. This allows the state of the secondary to converge.

Finish replicating and verifying all data

Phase 3: Draining 📁

🐺 Coordinator : Flush CI traces in Redis to the database
- In a Rails console in Azure:
- ::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)
🐺 Coordinator : Wait for all repositories and wikis to become synchronized
- Staging: https://gstg.gitlab.com/admin/geo_nodes
- Production: https://gprd.gitlab.com/admin/geo_nodes
- Press "Sync Information"
- Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
- You can use sudo gitlab-rake geo:status instead if the UI is non-compliant
- If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
- On staging, this may not complete
🐺 Coordinator : Wait for all repositories and wikis to become verified
- Press "Verification Information"
- Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
- You can use sudo gitlab-rake geo:status instead if the UI is non-compliant
- If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
- On staging, verification may not complete
🐺 Coordinator : In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
🐘 Database-Wrangler : Ensure the prospective failover target in GCP is up to date
- Staging: postgres-01.db.gstg.gitlab.com
- Production: postgres-01-db-gprd.c.gitlab-production.internal
- sudo gitlab-psql -d gitlabhq_production -c "SELECT now() - pg_last_xact_replay_timestamp();"
- Assuming the clocks are in sync, this value should be close to 0
- If this is a large number, GCP may not have some data that is in Azure
🐺 Coordinator : Now disable all sidekiq-cron jobs on the secondary
- In a dedicated rails console on the secondary:
- loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }
- The loop should be stopped once sidekiq is shut down
- The geo_sidekiq_cron_config job or an RSS kill may re-enable the crons, which is why we run it in a loop
🐺 Coordinator : Wait for all Sidekiq jobs to complete on the secondary
- Staging: Navigate to https://gstg.gitlab.com/admin/background_jobs
- Production: Navigate to https://gprd.gitlab.com/admin/background_jobs
- Busy, Enqueued, Scheduled, and Retry should all be 0
- If a geo_metrics_update job is running, that can be ignored
🔪 Chef-Runner : Stop sidekiq in GCP
- This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
- Staging: knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"
- Production: knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"
- Check that no sidekiq processes show in the GitLab admin panel
🐺 Coordinator : Stop the Sidekiq queue disabling loop from above

At this point all data on the primary should be present in exactly the same form on the secondary. There is no outstanding work in sidekiq on the primary or secondary, and if we failover, no data will be lost.

Stopping all cronjobs on the secondary means it will no longer attempt to run background synchronization operations against the primary, reducing the chance of errors while it is being promoted.

Promote the secondary

Phase 4: Reconfiguration, Part 1 📁

☁ Cloud-conductor : Incremental snapshot of database disks in case of failback in Azure and GCP
- Staging: bin/snapshot-dbs staging
- Production: bin/snapshot-dbs production
🔪 Chef-Runner : Ensure GitLab Pages sync is completed
- The incremental rsync commands set off above should be completed by now
- If still ongoing, the DNS update will cause some Pages sites to temporarily revert
☁ Cloud-conductor : Update DNS entries to refer to the GCP load-balancers
- Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
- Staging
  - staging.gitlab.com A 35.227.123.228
  - altssh.staging.gitlab.com A 35.185.33.132
  - *.staging.gitlab.io A 35.229.69.78
  - DO NOT change staging.gitlab.io.
- Production UNTESTED
  - gitlab.com A 35.231.145.151
  - altssh.gitlab.com A 35.190.168.187
  - *.gitlab.io A 35.185.44.232
  - DO NOT change gitlab.io.
🐘 Database-Wrangler : Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
```
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=100 where name like '%gstg%'"
```
🐘 Database-Wrangler : Gracefully turn off the Azure postgresql standby instances.
- Keep everything, just ensure it’s turned off
```
$ knife ssh "role:staging-base-db-postgres AND NOT fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```
🐘 Database-Wrangler : Gracefully turn off the Azure postgresql primary instance.
- Keep everything, just ensure it’s turned off
```
$ knife ssh "fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```

🐘 Database-Wrangler : After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.

Confirm gitlab-ctl repmgr cluster show reflects the desired state

Confirm pgbouncer node in GCP (Password is in 1password)

Staging: pgbouncer-01-db-gstg
Production: pgbouncer-01-db-gprd

$ gitlab-ctl pgb-console
...
pgbouncer# SHOW DATABASES;
# You want to see lines like
gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production |            |       100 |            5 |           |               0 |                   0
gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production |            |       150 |            5 |           |               0 |                   0
...
pgbouncer# SHOW SERVERS;
# You want to see lines like
  S    | gitlab    | gitlabhq_production | idle  | PRIMARY_IP | 5432 | PGBOUNCER_IP |      54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 |    |      19430 |

🐘 Database-Wrangler : In case automated failover does not occur, perform a manual failover
- Promote the desired primary
```
$ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
```
- Instruct the remaining standby nodes to follow the new primary
```
$ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
```
  Note: This will fail on the WAL-E node
🐘 Database-Wrangler : Check the database is now read-write
- Connect to the newly promoted primary in GCP
- sudo gitlab-psql -d gitlabhq_production -c "select * from pg_is_in_recovery();"
- The result should be F
🔪 Chef-Runner : Update the chef configuration according to
- Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
- Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
🔪 Chef-Runner : Run chef-client on every node to ensure Chef changes are applied and all Geo secondary services are stopped
- STAGING knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'
- PRODUCTION UNTESTED knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'
🔪 Chef-Runner : Ensure that gitlab.rb has the correct external_url on all hosts
- Staging: knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2
- Production: knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2
🔪 Chef-Runner : Ensure that important processes have been restarted on all hosts
- Staging: knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3
- Production: knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3
- Unicorn
- Sidekiq
- Gitlab Pages
🔪 Chef-Runner : Fix the Geo node hostname for the old secondary
- This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
- In a Rails console in GCP:
  - Staging: GeoNode.where(url: "https://gstg.gitlab.com/").update_all(url: "https://azure.gitlab.com/")
  - Production: GeoNode.where(url: "https://gprd.gitlab.com/").update_all(url: "https://azure.gitlab.com/")
🔪 Chef-Runner : Flush any unwanted Sidekiq jobs on the promoted secondary
- Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)
🔪 Chef-Runner : Clear Redis cache of promoted secondary
- Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke
🔪 Chef-Runner : Start sidekiq in GCP
- This will automatically re-enable the disabled sidekiq-cron jobs
- Staging: knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"
- Production: knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster" Check that sidekiq processes show up in the GitLab admin panel

Health check

🐺 Coordinator : Check for any alerts that might have been raised and investigate them
- Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
- Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
- The old primary in the GCP environment, backed by WAL-E log shipping, will report "replication lag too large" and "unused replication slot". This is OK.

During-Blackout QA

Phase 5: Verification, Part 1 📁

The details of the QA tasks are listed in the test plan document.

🏆 Quality : All "during the blackout" QA automated tests have succeeded
🏆 Quality : All "during the blackout" QA manual tests have succeeded

Evaluation of QA results - Decision Point

Phase 6: Commitment 📁

If QA has succeeded, then we can continue to "Complete the Migration". If some QA has failed, the 🐺 Coordinator must decide whether to continue with the failover, or to abort, failing back to Azure. A decision to continue in these circumstances should be counter-signed by the 🎩 Head Honcho .

The top priority is to maintain data integrity. Failing back after the blackout window has ended is very difficult, and will result in any changes made in the interim being lost.

Don't Panic! Consult the failover priority list

Problems may be categorized into three broad causes - "unknown", "missing data", or "misconfiguration". Testers should focus on determining which bucket a failure falls into, as quickly as possible.

Failures with an unknown cause should be investigated further. If we can't determine the root cause within the blackout window, we should fail back.

We should abort for failures caused by missing data unless all the following apply:

The scope is limited and well-known
The data is unlikely to be missed in the very short term
A named person will own back-filling the missing data on the same day

We should abort for failures caused by misconfiguration unless all the following apply:

The fix is obvious and simple to apply
The misconfiguration will not cause data loss or corruption before it is corrected
A named person will own correcting the misconfiguration on the same day

If the number of failures seems high (double digits?), strongly consider failing back even if they each seem trivial - the causes of each failure may interact in unexpected ways.

Complete the Migration (T plus 2 hours)

Phase 7: Restart Mailing 📁

🔪 Chef-Runner : PRODUCTION ONLY Re-enable mailing queues on sidekiq-asap (revert chef-repo!1922)
1. emails_on_push queue
2. mailers queue
3. (admin_emails queue doesn't exist any more)
4. Rotate the password of the incoming@gitlab.com account and update the vault
5. Run chef-client and restart mailroom:
  - bundle exec knife ssh role:gprd-base-be-mailroom 'sudo chef-client; sudo gitlab-ctl restart mailroom'
🐺 Coordinator: PRODUCTION ONLY Ensure the secondary can send emails
1. Run the following in a Rails console (changing you to yourself): Notify.test_email("you+test@gitlab.com", "Test email", "test").deliver_now
2. Ensure you receive the email

Phase 8: Reconfiguration, Part 2 📁

Phase 9: Communicate 📁

🐺 Coordinator : Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
PRODUCTION ONLY ☎ Comms-Handler : Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- `GitLab.com planned maintenance for migration to @GCPcloud is almost complete. GitLab.com is available although we're continuing to verify that all systems are functioning correctly. We're live on YouTube``
PRODUCTION ONLY ☎ Comms-Handler : Tweet from @gitlabstatus
- GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube

Phase 10: Verification, Part 2 📁

Start After-Blackout QA This is the second half of the test plan.
1. 🏆 Quality : Ensure all "after the blackout" QA automated tests have succeeded
2. 🏆 Quality : Ensure all "after the blackout" QA manual tests have succeeded

PRODUCTION ONLY Post migration

🐺 Coordinator : Close the failback issue - it isn't needed
☁ Cloud-conductor : Disable unneeded resources in the Azure environment completion more effectively
- The Pages LB proxy must be retained
- We should retain all filesystem data for a defined period in case of problems (1 week? 3 months?)
- Unused machines can be switched off
☁ Cloud-conductor : Change GitLab settings: https://gprd.gitlab.com/admin/application_settings
- Metrics - Influx -> InfluxDB host should be performance-01-inf-gprd.c.gitlab-production.internal

2018-08-08 STAGING failover attempt: main procedure

Failover Team

Immediately

Support Options

Database hosts

Staging

Production

Console hosts

Dashboards and debugging

PRODUCTION ONLY T minus 3 weeks (Date TBD) 📁

** PRODUCTION ONLY** T minus 1 week (Date TBD) 📁

T minus 1 day (Date TBD) 📁

T minus 3 hours (FAILOVER_DATE) 📁

T minus 1 hour (FAILOVER_DATE) 📁

Failover Call

Roll call

Notify Users of Maintenance Window

Monitoring

Health check

T minus zero (failover day) (FAILOVER_DATE) 📁

Failover Procedure

Prevent updates to the primary

Phase 1: Block non-essential network access to the primary 📁

Phase 2: Commence Shutdown in Azure 📁

Finish replicating and verifying all data

Phase 3: Draining 📁

Promote the secondary

Phase 4: Reconfiguration, Part 1 📁

Health check

During-Blackout QA

Phase 5: Verification, Part 1 📁

Evaluation of QA results - Decision Point

Phase 6: Commitment 📁

Complete the Migration (T plus 2 hours)

Phase 7: Restart Mailing 📁

Phase 8: Reconfiguration, Part 2 📁

Phase 9: Communicate 📁

Phase 10: Verification, Part 2 📁

PRODUCTION ONLY Post migration

PRODUCTION ONLY T minus 1 week (Date TBD) 📁