2018-07-28 PRODUCTION semifailover + failback attempt

NOTE: this is a partial failover attempt for production. The listed priorities are:

Closing the front door properly
Validating we have no Chef dependencies on GitLab.com when door is closed
Draining the queues and timing it with workaround
Reopening the front door

Steps will be removed as necessary to pare it down to this list of actions.

Failover Team

Role	Assigned To
🐺 Coordinator	@nick
🔪 Chef-Runner	@ahmadsherif
☎ Comms-Handler	@glopezfernandez
🐘 Database-Wrangler	@jarv
☁ Cloud-conductor	@ahmadsherif
🏆 Quality	-
↩ Fail-back Handler (Staging Only)	@ahmadsherif
🎩 Head Honcho (Production Only)	@glopezfernandez

(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)

Immediately

Perform these steps when the issue is created.

🐺 Coordinator : Fill out the names of the failover team in the table above.
🐺 Coordinator : Fill out dates/times and links in this issue:
- START_TIME & END_TIME
- GOOGLE_DOC_LINK (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- PRODUCTION ONLY LINK_TO_BLOG_POST
- PRODUCTION ONLY END_TIME

Support Options

Provider	Plan	Details	Create Ticket
Microsoft Azure	Profession Direct Support	24x7, email & phone, 1 hour turnaround on Sev A	Create Azure Support Ticket
Google Cloud Platform	Gold Support	24x7, email & phone, 1hr response on critical issues	Create GCP Support Ticket

Database hosts

Production

graph TD;
  postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
  postgres02a --> postgres03a["postgres-03.db.prd"];
  postgres02a --> postgres04a["postgres-04.db.prd"];
  postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
  postgres01g --> postgres02g["postgres-02-db-gprd"];
  postgres01g --> postgres03g["postgres-03-db-gprd"];
  postgres01g --> postgres04g["postgres-04-db-gprd"];

Console hosts

The usual rails and database console access hosts are broken during the failover. Any shell commands should, instead, be run on the following machines by SSHing to them. Rails console commands should also be run on these machines, by SSHing to them and issuing a sudo gitlab-rails console command first.

Production:
- Azure: web-01.sv.prd.gitlab.com
- GCP: web-01-sv-gprd.c.gitlab-production.internal

Grafana dashboards

These dashboards might be useful during the failover:

Production:
- Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
- GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd

PRODUCTION ONLY T minus 3 weeks (Date TBD)

Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
Ensure this issue has been created on dev.gitlab.org, since gitlab.com will be unavailable during the real failover!!!

PRODUCTION ONLY T minus 1 week (Date TBD)

🔪 Chef-Runner : Scale up the gprd fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
☎ Comms-Handler : communicate date to Google
☎ Comms-Handler : announce in #general slack and on team call date of failover.
☎ Comms-Handler : Marketing team publish blog post about upcoming GCP failover
☎ Comms-Handler : Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
☎ Comms-Handler : Ensure that YouTube stream will be available for Zoom call
☎ Comms-Handler : Tweet blog post from @gitlab and @gitlabstatus
- Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from START_TIME - END_TIME UTC. Follow @gitlabstatus for more details. LINK_TO_BLOG_POST
🔪 Chef-Runner : Ensure the GCP environment is inaccessible to the outside world

T minus 1 day (Date TBD)

[-] 🐺 Coordinator : Update GitLab shared runners to expire jobs after 1 hour
- In a Rails console, run:
- Ci::Runner.instance_type.update_all(maximum_timeout: 3600)
PRODUCTION ONLY ☎ Comms-Handler : Tweet from @gitlab
- Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow, from START_TIME - END_TIME UTC. Follow @gitlabstatus for more details. LINK_TO_BLOG_POST
PRODUCTION ONLY ☎ Comms-Handler : Retweet @gitlab tweet from @gitlabstatus with further details
- Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow. We'll be live on YouTube. Working doc: GOOGLE_DOC_LINK, Blog: LINK_TO_BLOG_POST

T minus 1 hour (Date TBD)

GitLab runners attempting to post artifacts back to GitLab.com during the maintenance window will fail and the artifacts may be lost. To avoid this as much as possible, we'll stop any new runner jobs from being picked up, starting an hour before the scheduled maintenance window.

PRODUCTION ONLY ☎ Comms-Handler : Tweet from @gitlabstatus
- As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until END_TIME UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: GOOGLE_DOC_LINK
☎ Comms-Handler : Post to #announcements on Slack:
- Production: GitLab.com is being migrated to GCP in *1 hour*. There is a 2-hour downtime window. We'll be live on YouTube. Notes in GOOGLE_DOC_LINK!
[-] 🔪 Chef-Runner : Stop any new GitLab CI jobs from being executed
- Block POST /api/v4/jobs/request
- Staging
  - https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
  - knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'
- Production
  - https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
  - knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'

☎ Comms-Handler : Create a broadcast message
- Staging: https://staging.gitlab.com/admin/broadcast_messages
- Production: https://gitlab.com/admin/broadcast_messages
- Text: gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from XX:XX on 2018-XX-YY UTC
- Start date: now
- End date: now + 3 hours

T minus zero (failover day) (Date TBD)

We expect the maintenance window to last for up to 2 hours, starting from now.

Failover Procedure

These steps will be run in a Zoom call. The 🐺 Coordinator runs the call, asking other roles to perform each step on the checklist at the appropriate time.

Changes are made one at a time, and verified before moving onto the next step. Whoever is performing a change should share their screen and explain their actions as they work through them. Everyone else should watch closely for mistakes or errors! A few things to keep an especially sharp eye out for:

Exposed credentials (except short-lived items like 2FA codes)
Running commands against the wrong hosts
Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)

Remember that the intention is for the call to be broadcast live on the day. If you see something happening that shouldn't be public, mention it.

Roll call

🐺 Coordinator : Ensure everyone mentioned above is on the call
🐺 Coordinator : Ensure the Zoom room host is on the call

Notify Users of Maintenance Window

PRODUCTION ONLY ☎ Comms-Handler : Tweet from @gitlabstatus
- GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube

Monitoring

🐺 Coordinator : To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
- Production: watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe{01..14}.lb.gitlab.com altssh{01..02}.lb.gitlab.com

Health check

🐺 Coordinator : Ensure that there are no active alerts on the azure or gcp environment.
- Production
  - GCP gprd: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
  - Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D

Prevent updates to the primary

Phase 1: Block non-essential network access to the primary

🔪 Chef-Runner : Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
- Production:
  - Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
  - Run knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'
  - Run knife ssh roles:gitlab-base-fe-git 'sudo chef-client'
🔪 Chef-Runner : Restart HAProxy on all LBs to terminate any on-going connections
- This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
- Production: knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'
🔪 Chef-Runner : Stop mailroom on all the nodes
- Production: knife ssh "role:gitlab-base-be-mailroom OR role:gprd-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'
🐺 Coordinator : Ensure traffic from a non-VPN IP is blocked
- Check the non-VPN hostinfo output and verify that the SSH column reads No and the REDIRECT column shows it being redirected to the migration blog post

Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.

Phase 2: Commence Sidekiq Shutdown in Azure

🐺 Coordinator : Disable Sidekiq crons that may cause updates on the primary
- In a separate rails console on the primary:
- loop { Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!); sleep 1 }
- The geo_sidekiq_cron_config job or an RSS kill may re-enable the crons, which is why we run it in a loop
🐺 Coordinator : Wait for all Sidekiq jobs to complete on the primary
- Production: https://gitlab.com/admin/background_jobs
- Press Queues -> Live Poll
- Wait for all queues not mentioned above to reach 0
- Wait for the number of Enqueued and Busy jobs to reach 0
- On staging, the repository verification queue may not empty
🐺 Coordinator : Handle Sidekiq jobs in the "retry" state
- Production: https://gitlab.com/admin/sidekiq/retries
- NOTE: This tab may contain confidential information. Do this out of screen capture!
- Delete jobs in idempotent or transient queues (reactive_caching or repository_update_remote_mirror, for instance)
- Delete jobs in other queues that are failing due to application bugs (error contains NoMethodError, for instance)
- Press "Retry All" to attempt to retry all remaining jobs immediately
- Repeat until 0 retries are present
🔪 Chef-Runner : Stop sidekiq in Azure
- Production: knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"
- Check that no sidekiq processes show in the GitLab admin panel

At this point, the primary can no longer receive any updates. This allows the state of the secondary to converge.

Finish replicating and verifying all data

Phase 3: Draining

🐺 Coordinator : Ensure any data not replicated by Geo is replicated manually. We know about these:
- CI traces in Redis
  - Run ::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)
🐺 Coordinator : Wait for all repositories and wikis to become synchronized
- Production: https://gprd.gitlab.com/admin/geo_nodes
- Press "Sync Information"
- Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
- You can use sudo gitlab-rake geo:status instead if the UI is non-compliant
- If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
🐺 Coordinator : Wait for all repositories and wikis to become verified
- Press "Verification Information"
- Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
- You can use sudo gitlab-rake geo:status instead if the UI is non-compliant
- If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
🐺 Coordinator : In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
🐘 Database-Wrangler : Ensure the prospective failover target in GCP is up to date
- Production: postgres-01-db-gprd.c.gitlab-production.internal
- sudo gitlab-psql -d gitlabhq_production -c "SELECT now() - pg_last_xact_replay_timestamp();"
- Assuming the clocks are in sync, this value should be close to 0
- If this is a large number, GCP may not have some data that is in Azure
🐺 Coordinator : Now disable all sidekiq-cron jobs on the secondary
- In a dedicated rails console on the secondary:
- loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }
- The geo_sidekiq_cron_config job or an RSS kill may re-enable the crons, which is why we run it in a loop
🐺 Coordinator : Wait for all Sidekiq jobs to complete on the secondary
- Production: Navigate to https://gprd.gitlab.com/admin/background_jobs
- Press Queues -> Live Poll
- Wait for all queues to reach 0, excepting emails_on_push and mailers (which are disabled)
- Wait for the number of Enqueued and Busy jobs to reach 0

At this point all data on the primary should be present in exactly the same form on the secondary. There is no outstanding work in sidekiq on the primary or secondary, and if we failover, no data will be lost.

Abort

We've tried everything we want to try at this point, so we need to restore Azure to its former state

Stop looping around the sidekiq-cron disable command
Restart Sidekiq in Azure
Update HAProxy config to allow access to Azure from the outside world

Edited Aug 01, 2018 by Nick Thomas