2018-08-04 PRODUCTION DRY RUN failover attempt: main procedure
DRY RUN on PRODUCTION!
The intent of this DRY RUN is to test out our process as best we can on the Production system without negatively impacting the system or doing the actual failover. It is also a time to run any processes (such as repository verification, etc) to get the system in a state ready to be migrated.
There will be a 1 hour maintenance window.
GREMLINS NOT INVITED!
What we want to accomplish in 1 hour
-
Run through processes (blackout, queue draining, verification, etc) without actually failing over -
Time how long it takes to drain Sidekiq queues -
Repository syncing and verification -
Re-enable the system by the end of the hour, or quicker!
Failover Team
Role | Assigned To |
---|---|
|
@bwalker |
|
@ahmadsherif |
|
@dawsmith |
|
@ibaum |
|
@ahmadsherif |
|
@remy |
|
@ahmadsherif |
|
@edjdev |
(try to ensure that
Immediately
Perform these steps when the issue is created.
-
🐺 Coordinator : Fill out the names of the failover team in the table above. -
🐺 Coordinator : Fill out dates/times and links in this issue:- Start Time:
13h00
& End Time:14h00
- Google Working Doc: https://docs.google.com/document/d/18vGk6dQs7L0oGQOb_bNiFa5JhwLq5WBS7oNxQy09ml8/edit (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- PRODUCTION ONLY Blog Post: https://about.gitlab.com/2018/07/19/gcp-move-update/
- PRODUCTION ONLY End Time: 14h00
- Start Time:
Support Options
Provider | Plan | Details | Create Ticket |
---|---|---|---|
Microsoft Azure | Profession Direct Support | 24x7, email & phone, 1 hour turnaround on Sev A | Create Azure Support Ticket |
Google Cloud Platform | Gold Support | 24x7, email & phone, 1hr response on critical issues | Create GCP Support Ticket |
Database hosts
Staging
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
Production
graph TD;
postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
postgres02a --> postgres03a["postgres-03.db.prd"];
postgres02a --> postgres04a["postgres-04.db.prd"];
postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
postgres01g --> postgres02g["postgres-02-db-gprd"];
postgres01g --> postgres03g["postgres-03-db-gprd"];
postgres01g --> postgres04g["postgres-04-db-gprd"];
Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a sudo gitlab-rails console
command
first.
- Staging:
- Azure:
web-01.sv.stg.gitlab.com
- GCP:
web-01-sv-gstg.c.gitlab-staging-1.internal
- Azure:
- Production:
- Azure:
web-01.sv.prd.gitlab.com
- GCP:
web-01-sv-gprd.c.gitlab-production.internal
- Azure:
Grafana dashboards
These dashboards might be useful during the failover:
- Staging:
- Production:
📁
PRODUCTION ONLY T minus 3 weeks (Date TBD) -
Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523 -
Ensure this issue has been created on dev.gitlab.org
, sincegitlab.com
will be unavailable during the real failover!!!
📁
** PRODUCTION ONLY** T minus 1 week (Date TBD) -
🔪 Chef-Runner : Scale up thegprd
fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286 -
☎ Comms-Handler : communicate date to Google -
☎ Comms-Handler : announce in #general slack and on team call date of failover. -
☎ Comms-Handler : Marketing team publish blog post about upcoming GCP failover -
☎ Comms-Handler : Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
-
☎ Comms-Handler : Ensure that YouTube stream will be available for Zoom call -
☎ Comms-Handler : Tweet blog post from@gitlab
and@gitlabstatus
Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from 13h00 - 14h00 UTC. Follow @gitlabstatus for more details. https://about.gitlab.com/2018/07/19/gcp-move-update/
-
🔪 Chef-Runner : Ensure the GCP environment is inaccessible to the outside world
📁
T minus 1 day (Date TBD) -
🐺 Coordinator : Perform (or coordinate) Preflight Checklist -
PRODUCTION ONLY ☎ Comms-Handler : Tweet from@gitlab
.- Tweet content from
/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh
- Tweet content from
-
PRODUCTION ONLY ☎ Comms-Handler : Retweet@gitlab
tweet from@gitlabstatus
with further details- Tweet content from
/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh
- Tweet content from
📁
T minus 3 hours (2018-08-04) -
🐺 Coordinator : Update GitLab shared runners to expire jobs after 1 hour- In a Rails console, run:
Ci::Runner.instance_type.update_all(maximum_timeout: 3600)
📁
T minus 1 hour (2018-08-04) GitLab runners attempting to post artifacts back to GitLab.com during the maintenance window will fail and the artifacts may be lost. To avoid this as much as possible, we'll stop any new runner jobs from being picked up, starting an hour before the scheduled maintenance window.
-
PRODUCTION ONLY ☎ Comms-Handler : Tweet from@gitlabstatus
As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until 14h00 UTC. GitLab.com will undergo maintenance in 1 hour.
-
☎ Comms-Handler : Post to #announcements on Slack:/opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh
-
PRODUCTION ONLY ☁ Cloud-conductor : Create a maintenance window in PagerDuty for GitLab Production service for 2 hours starting in an hour from now. -
PRODUCTION ONLY ☁ Cloud-conductor : Create an alert silence for 2 hours starting in an hour from now with the following matcher(s):-
environment
:prd
-
-
🔪 Chef-Runner : Stop any new GitLab CI jobs from being executed- Block
POST /api/v4/jobs/request
- Production
-
Create an alert silence for 3 hours (starting now) with the following matchers:
-
environment
:prd
-
alertname
:High4xxApiRateLimit|High4xxRateLimit
, check "Regex"
-
- https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'
-
Create an alert silence for 3 hours (starting now) with the following matchers:
- Block
-
☎ Comms-Handler : Create a broadcast message- Production: https://gitlab.com/admin/broadcast_messages
- Text:
gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 1 hour from 13:00 on 2018-08-04 UTC
- Start date: now
- End date: now + 3 hours
-
☁ Cloud-conductor : Initial snapshot of database disks in case of failback in Azure and GCP- Production:
bin/snapshot-dbs production
- Production:
-
🔪 Chef-Runner : Stop automatic incremental GitLab Pages sync- Disable the cronjob on the Azure pages NFS server
-
sudo crontab -e
to get an editor window, comment out the line involving rsync
-
🔪 Chef-Runner : Start parallelized, incremental GitLab Pages sync-
Expected to take ~30 minutes, run in screen/tmux! On the Azure pages NFS server!
-
Updates to pages after the transfer starts will be lost.
-
The user running the rsync must have full sudo access on both azure and gcp pages.
-
Very manual, looks a little like the following at present:
-
Production:
ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com tmux sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
-
📁
T minus zero (failover day) (2018-08-04) We expect the maintenance window to last for up to 2 hours, starting from now.
Failover Procedure
These steps will be run in a Zoom call. The
Changes are made one at a time, and verified before moving onto the next step. Whoever is performing a change should share their screen and explain their actions as they work through them. Everyone else should watch closely for mistakes or errors! A few things to keep an especially sharp eye out for:
- Exposed credentials (except short-lived items like 2FA codes)
- Running commands against the wrong hosts
- Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If you see something happening that shouldn't be public, mention it.
Roll call
-
🐺 Coordinator : Ensure everyone mentioned above is on the call -
🐺 Coordinator : Ensure the Zoom room host is on the call
Notify Users of Maintenance Window
-
PRODUCTION ONLY ☎ Comms-Handler : Tweet from@gitlabstatus
GitLab.com will soon shutdown for planned maintenance for testing of the migration to @GCPcloud. See you soon!
Monitoring
-
🐺 Coordinator : To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.- Production:
watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe{01..16}.lb.gitlab.com altssh{01..02}.lb.gitlab.com
- Production:
Health check
-
🐺 Coordinator : Ensure that there are no active alerts on the azure or gcp environment.- Production
- GCP
gprd
: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd - Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
- GCP
- Production
Prevent updates to the primary
📁
Phase 1: Block non-essential network access to the primary -
🔪 Chef-Runner : Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else- Production:
- Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
- Run
knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'
- Run
knife ssh roles:gitlab-base-fe-git 'sudo chef-client'
- Production:
-
🔪 Chef-Runner : Restart HAProxy on all LBs to terminate any on-going connections- This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
- Production:
knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'
-
🐺 Coordinator : Ensure traffic from a non-VPN IP is blocked- Check the non-VPN
hostinfo
output and verify that the SSH column readsNo
and the REDIRECT column shows it being redirected to the migration blog post
- Check the non-VPN
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
📁
Phase 2: Commence Shutdown in Azure -
🔪 Chef-Runner : Stop mailroom on all the nodes/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh
-
🔪 Chef-Runner PRODUCTION ONLY: Stopsidekiq-pullmirror
in Azure/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh
-
🐺 Coordinator : Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:- In a separate terminal on the deploy host:
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh
- The
geo_sidekiq_cron_config
job or an RSS kill may re-enable the crons, which is why we run it in a loop - Alternatively
- In a separate rails console on the primary:
loop { Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!); sleep 1 }
- In a separate terminal on the deploy host:
-
🐺 Coordinator : Wait for repository verification on the primary to complete- Production: https://gitlab.com/admin/geo_nodes -
gitlab.com
node - Expand the
Verification Info
tab - Wait for the number of
unverified
repositories to reach 0 - Resolve any repositories that have
failed
verification
- Production: https://gitlab.com/admin/geo_nodes -
-
🐺 Coordinator : Wait for all Sidekiq jobs to complete on the primary- Production: https://gitlab.com/admin/background_jobs
- Press
Queues -> Live Poll
- Wait for all queues not mentioned above to reach 0
- Wait for the number of
Enqueued
andBusy
jobs to reach 0 - On staging, the repository verification queue may not empty
-
🐺 Coordinator : Handle Sidekiq jobs in the "retry" state- Production: https://gitlab.com/admin/sidekiq/retries
- NOTE: This tab may contain confidential information. Do this out of screen capture!
- Delete jobs in idempotent or transient queues (
reactive_caching
orrepository_update_remote_mirror
, for instance) - Delete jobs in other queues that are failing due to application bugs (error contains
NoMethodError
, for instance) - Press "Retry All" to attempt to retry all remaining jobs immediately
- Repeat until 0 retries are present
-
🔪 Chef-Runner : Stop sidekiq in Azure- Production:
knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"
- Check that no sidekiq processes show in the GitLab admin panel
- Production:
At this point, the primary can no longer receive any updates. This allows the state of the secondary to converge.
Finish replicating and verifying all data
📁
Phase 3: Draining -
🐺 Coordinator : Ensure any data not replicated by Geo is replicated manually. We know about these:-
CI traces in Redis - Run
::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)
- Run
-
-
🐺 Coordinator : Wait for all repositories and wikis to become synchronized- Production: https://gprd.gitlab.com/admin/geo_nodes
- Press "Sync Information"
- Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
- You can use
sudo gitlab-rake geo:status
instead if the UI is non-compliant - If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
-
🐺 Coordinator : Wait for all repositories and wikis to become verified- Press "Verification Information"
- Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
- You can use
sudo gitlab-rake geo:status
instead if the UI is non-compliant - If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
-
🐺 Coordinator : In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor" -
🐘 Database-Wrangler : Ensure the prospective failover target in GCP is up to date- Production:
postgres-01-db-gprd.c.gitlab-production.internal
sudo gitlab-psql -d gitlabhq_production -c "SELECT now() - pg_last_xact_replay_timestamp();"
- Assuming the clocks are in sync, this value should be close to 0
- If this is a large number, GCP may not have some data that is in Azure
- Production:
-
🐺 Coordinator : Now disable all sidekiq-cron jobs on the secondary- In a dedicated rails console on the secondary:
loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }
- The
geo_sidekiq_cron_config
job or an RSS kill may re-enable the crons, which is why we run it in a loop
-
🐺 Coordinator : Wait for all Sidekiq jobs to complete on the secondary- Review status of the running Sidekiq monitor script started in phase 2, above, wait for
--> Status: PROCEED
- Need more details?
- Production: Navigate to https://gprd.gitlab.com/admin/background_jobs
- Review status of the running Sidekiq monitor script started in phase 2, above, wait for
-
🔪 Chef-Runner : Stop sidekiq in GCP- This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
- Staging:
knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"
- Production:
knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"
- Check that no sidekiq processes show in the GitLab admin panel
At this point all data on the primary should be present in exactly the same form on the secondary. There is no outstanding work in sidekiq on the primary or secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run background synchronization operations against the primary, reducing the chance of errors while it is being promoted.
Promote the secondary
Since this is a DRY RUN on Production, most steps from "Phase 4: Reconfiguration, Part 1" have been removed
📁
Phase 4: Reconfiguration, Part 1 -
☁ Cloud-conductor : Incremental snapshot of database disks in case of failback in Azure and GCP- Production:
bin/snapshot-dbs production
- Production:
-
🔪 Chef-Runner : Ensure GitLab Pages sync is completed- The incremental
rsync
commands set off above should be completed by now - If still ongoing, the DNS update will cause some Pages sites to temporarily revert
- The incremental
Health check
-
🐺 Coordinator : Check for any alerts that might have been raised and investigate them- Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
- The old primary in the GCP environment, backed by WAL-E log shipping, will report "replication lag too large" and "unused replication slot". This is OK.
During-Blackout QA
We will not be doing any QA this run
📁
Phase 6: Commitment No decision to be made since we're not failing over
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the
The top priority is to maintain data integrity. Failing back after the blackout window has ended is very difficult, and will result in any changes made in the interim being lost.
Don't Panic! Consult the failover priority list
Problems may be categorized into three broad causes - "unknown", "missing data", or "misconfiguration". Testers should focus on determining which bucket a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
- The scope is limited and well-known
- The data is unlikely to be missed in the very short term
- A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
- The fix is obvious and simple to apply
- The misconfiguration will not cause data loss or corruption before it is corrected
- A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing back even if they each seem trivial - the causes of each failure may interact in unexpected ways.
Restore System to pre-DRY RUN status
-
run revised failback issue to re-enable system
Complete the Migration (T plus 2 hours)
Since this is a DRY RUN on Production, all of "Phase 7: Restart Mailing" and "Phase 8: Reconfiguration, Part 2" have been removed.
📁
Phase 9: Communicate -
🐺 Coordinator : Remove the broadcast message (if it's after the initial window, it has probably expired automatically) -
PRODUCTION ONLY ☎ Comms-Handler : Tweet from@gitlabstatus
GitLab.com's test migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly.
📁
Phase 10: Verification, Part 2 No doing QA this run
PRODUCTION ONLY Post migration
Since this is a DRY RUN on Production, all of this section has been removed.