migration issueshttps://dev.gitlab.org/gitlab-com/migration/-/issues2018-08-09T20:41:07Zhttps://dev.gitlab.org/gitlab-com/migration/-/issues/962018-08-11 PRODUCTION failover attempt: failback2018-08-09T20:41:07Zgcp-migration-bot **only needed for the migration effort**2018-08-11 PRODUCTION failover attempt: failback# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. b...# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. before opening it up to the public), it will also be super-useful to have
this documented and tested.
The priority is to get the Azure site working again as quickly as possible. As
the GCP side will be inaccessible, returning it to operation is of secondary
importance.
This issue should not be closed until both Azure and GCP sites are in full
working order, including database replication between the two sites.
## Fail back to the Azure site
1. [x] ↩️ {+ Fail-back Handler +}: Make the GCP environment **inaccessible** again, if necessary
1. Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
1. Production: ???
1. [x] ↩️ {+ Fail-back Handler +}: Update the DNS entries to refer to the Azure load balancer
1. Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
1. Staging
- [x] `staging.gitlab.com A 40.84.60.110`
- [x] `altssh.staging.gitlab.com A 104.46.121.194`
- [x] `*.staging.gitlab.io CNAME pages01.stg.gitlab.com`
1. Production
- [ ] `gitlab.com A 52.167.219.168`
- [ ] `altssh.gitlab.com A 52.167.133.162`
- [ ] `*.gitlab.io A 52.167.214.135`
1. [ ] OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback.
- [ ] Start the primary Azure node
```shell
azure_primary# gitlab-ctl start postgresql
```
- [ ] Remove nodes from the Azure repmgr cluster.
```shell
azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
- [ ] In a tmux or screen session on the Azure standby node, resync the database
```shell
azure_standby# PGSSLMODE=disable gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
```
**Note**: This can run for several hours. Do not wait for completion.
- [ ] Remove Azure nodes from the GCP cluster by running this on the GCP primary
```shell
gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
1. [ ] Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it
**Note:** Skip this if introducing a postgresql split-brain
1. [x] Ensure that repmgr priorities for GCP are -1. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=-1 where name like '%gstg%'"
```
1. [x] Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: `gitlab-ctl stop postgresql`
1. [x] Start postgresql on the Azure staging primary node `gitlab-ctl start postgresql`
1. [x] Ensure `gitlab-ctl repmgr cluster show` reports an Azure node as the primary in Azure:
```shell
gitlab-ctl repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------
* master | postgres02.db.stg.gitlab.com | | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
standby | postgres01.db.stg.gitlab.com | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
```
1. [x] Start Azure secondaries
* Start postgresql on the Azure staging secondary node `gitlab-ctl start postgresql`
* Verify it replicates from the primary. On the primary take a look at `SELECT * FROM pg_stat_replication` which should include the newly started secondary.
* Production: Repeat the above for other Azure secondaries. Start one after the other.
1. [x] ↩️ {+ Fail-back Handler +}: **Verify that the DNS update has propagated**
back online
1. [x] ↩️ {+ Fail-back Handler +}: Start sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
1. [x] ↩️ {+ Fail-back Handler +}: Restore the Azure Pages load-balancer configuration
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
1. [ ] ↩️ {+ Fail-back Handler +}: Set the GitLab shared runner timeout back to 3 hours
1. [x] ↩️ {+ Fail-back Handler +}: Restart automatic incremental GitLab Pages sync
* Enable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, uncomment the line involving rsync
1. [ ] ↩️ {+ Fail-back Handler +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
1. [x] ↩️ {+ Fail-back Handler +}: Enable access to the azure environment from the
outside world
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
## Restore the GCP site to being a working secondary
1. [x] ↩️ {+ Fail-back Handler +}: Turn the GCP site back into a secondary
* Undo the chef-repo changes. If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* `bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
* `bundle exec knife role from file roles/gprd-base-fe-web.json roles/gprd-base.json`
1. [x] Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster
1. Remove the old data with `rm -rf /var/opt/gitlab/postgresql/data`
1. Re-initialize the database by running:
**Note:** This step can take over an hour. Consider running it in a screen/tmux session.
```shell
# su gitlab-psql -c "PGSSLMODE=disable /opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr"
```
1. Start the database with `gitlab-ctl start postgresql`
1. Register the database with the cluster by running `gitlab-ctl repmgr standby register`
1. [x] ↩️ {+ Fail-back Handler +}: Reconfigure every changed gstg node
1. bundle exec knife ssh roles:gstg-base "sudo chef-client"
1. [x] ↩️ {+ Fail-back Handler +}: Restart Unicorn and Sidekiq
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'`
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'`
1. [x] ↩️ {+ Fail-back Handler +}: Clear cache on gstg web nodes to correct broadcast message cache
* `sudo gitlab-rake cache:clear:redis`
1. [x] ↩️ {+ Fail-back Handler +}: Verify database replication is working
1. Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
1. [x] ↩️ {+ Fail-back Handler +}: Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top
1. [x] ↩️ {+ Fail-back Handler +}: Confirm pgbouncer is talking to the correct hosts
* `sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/pgbouncer -U pgbouncer -d pgbouncer -p 6432`
* SQL: `SHOW DATABASES;`
1. [x] ↩️ {+ Fail-back Handler +}: It is now safe to delete the database server snapshotshttps://dev.gitlab.org/gitlab-com/migration/-/issues/942018-08-11 PRODUCTION switchover attempt: main procedure2018-10-04T14:21:19Zgcp-migration-bot **only needed for the migration effort**2018-08-11 PRODUCTION switchover attempt: main procedure# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordina...# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator | @nick |
| 🔪 Chef-Runner | @ahmadsherif |
| ☎ Comms-Handler | @dawsmith |
| 🐘 Database-Wrangler | @jarv |
| ☁ Cloud-conductor | @ahmadsherif |
| 🏆 Quality | @remy |
| ↩ Fail-back Handler | @ahmadsherif |
| 🎩 Head Honcho | @edjdev |
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
# Immediately
Perform these steps when the issue is created.
- [x] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
- [x] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
- Start Time: `10h00` & End Time: `12h00`
- Google Working Doc: https://docs.google.com/document/d/1CzkieGnqJStAh3pMwgg-v62-HTpBdFDoP9smGaFE1Ko/edit (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- Blog Post: https://about.gitlab.com/2018/07/19/gcp-move-update/
- End Time: 12h00
# Support Options
| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
# Database hosts
## Production
```mermaid
graph TD;
postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
postgres02a --> postgres03a["postgres-03.db.prd"];
postgres02a --> postgres04a["postgres-04.db.prd"];
postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
postgres01g --> postgres02g["postgres-02-db-gprd"];
postgres01g --> postgres03g["postgres-03-db-gprd"];
postgres01g --> postgres04g["postgres-04-db-gprd"];
```
# Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.
* Production:
* Azure: `web-01.sv.prd.gitlab.com`
* GCP: `web-01-sv-gprd.c.gitlab-production.internal`
# Dashboards and debugging
* These dashboards might be useful during the failover:
* Production:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* Sentry includes application errors. At present, Azure and GCP log to the same Sentry instance
* Production:
* Workhorse: https://sentry.gitlap.com/gitlab/gitlab-workhorse-gitlabcom/
* Rails (backend): https://sentry.gitlap.com/gitlab/gitlabcom/
* Rails (frontend): https://sentry.gitlap.com/gitlab/gitlabcom-clientside/
* Gitaly (golang): https://sentry.gitlap.com/gitlab/gitaly-production/
* Gitaly (ruby): https://sentry.gitlap.com/gitlab/gitlabcom-gitaly-ruby/
* The logs can be used to inspect any area of the stack in more detail
* https://log.gitlab.net/
# T minus 3 weeks [📁](bin/scripts/02_failover/010_t-3w)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [x] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# T minus 1 week [📁](bin/scripts/02_failover/020_t-1w)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [x] ☎ {+ Comms-Handler +}: communicate date to Google
1. [x] ☎ {+ Comms-Handler +}: announce in #general slack and on team call date of failover.
1. [x] ☎ {+ Comms-Handler +}: Marketing team publish blog post about upcoming GCP failover
1. [x] ☎ {+ Comms-Handler +}: Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
1. [x] ☎ {+ Comms-Handler +}: Ensure that YouTube stream will be available for Zoom call
1. [x] ☎ {+ Comms-Handler +}: Tweet blog post from `@gitlab` and `@gitlabstatus`
- `Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from 10h00 - 12h00 UTC. Follow @gitlabstatus for more details. https://about.gitlab.com/2018/07/19/gcp-move-update/`
1. [x] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (2018-08-10) [📁](bin/scripts/02_failover/030_t-1d)
1. [x] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [x] ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
1. [x] ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (2018-08-11) [📁](bin/scripts/02_failover/040_t-3h)
1. [x] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 3600)`
# T minus 1 hour (2018-08-11) [📁](bin/scripts/02_failover/050_t-1h)
GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
1. [x] ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until 12h00 UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: https://docs.google.com/document/d/1CzkieGnqJStAh3pMwgg-v62-HTpBdFDoP9smGaFE1Ko/edit`
1. [x] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `/opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [x] ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [x] ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
1. [x] 🔪 {+ Chef-Runner +}: Silence production alerts
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
* `provider`: `azure`
* `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
1. [x] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [x] ☎ {+ Comms-Handler +}: Create a broadcast message
* Production: https://gitlab.com/admin/broadcast_messages
* Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from 10h00 on 2018-08-11 UTC`
* Start date: now
* End date: now + 3 hours
1. [x] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
* Production: `bin/snapshot-dbs production`
1. [x] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
* `sudo crontab -e` to get an editor window, comment out the line involving a pages-sync script
1. [x] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after the transfer starts will be lost.
* The user running the rsync _must_ have full sudo access on both azure and gcp pages.
* Very manual, looks a little like the following at present:
* Production:
```
ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
## Failover Call
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.
Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:
* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.
### Roll call
- [x] 🐺 {+ Comms-Handler +}: make sure Youtube stream is started
- [x] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [x] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call
### Notify Users of Maintenance Window
1. [x] ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube`
1. [x] ☎ {+ Comms-Handler +}: Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- `GitLab.com planned maintenance for migration to @GCPcloud is starting. See you on the other side! We'll be live on YouTube`
### Monitoring
- [x] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe0{1..9}.lb.gitlab.com fe{10..16}.lb.gitlab.com altssh0{1..2}.lb.gitlab.com`
### Health check
1. [ ] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
* Production
* GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
* Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
# T minus zero (failover day) (2018-08-11) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
## Switchover Procedure
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [x] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Production:
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
* Run `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
* Run `knife ssh roles:gitlab-base-fe-git 'sudo chef-client'`
1. [x] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
1. [x] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [x] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
1. [x] 🔪 {+ Chef-Runner +}: Stop `sidekiq-pullmirror` in Azure
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh`
1. [x] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
* In a separate terminal on the deploy host: `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* The loop should be stopped once sidekiq is shut down
* Wait for `--> Status: PROCEED`
1. [x] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Production: https://performance.gitlab.net/d/000000286/gcp-failover-azure?orgId=1&var-environment=prd
* Wait for the number of `unverified` repositories and wikis to reach 0
* Resolve any repositories that have `failed` verification
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Production: https://gitlab.com/admin/background_jobs
* Press `Queues -> Live Poll`
* Wait for all queues not mentioned above to reach 0
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
1. [x] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
* Production: https://gitlab.com/admin/sidekiq/retries
* **NOTE**: This tab may contain confidential information. Do this out of screen capture!
* Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
* Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
* Press "Retry All" to attempt to retry all remaining jobs immediately
* Repeat until 0 retries are present
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [x] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [x] 🐺 {+ Coordinator +}: Flush CI traces in Redis to the database
* In a Rails console in Azure:
* `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [x] 🐺 {+ Coordinator +}: Reconcile negative registry entries
* Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md
1. [x] 🐺 {+ Coordinator +}: Fill event log gaps manually
* Script in https://gitlab.com/gitlab-com/migration/merge_requests/201/diffs :scream\_cat:
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Production: Grafana dashboard: https://dashboards.gitlab.net/d/l8ifheiik/geo-status?refresh=5m&orgId=1
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Press "Sync Information"
* Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
* Production: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
* You can also use `sudo gitlab-rake geo:status`
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
1. [x] 🐺 {+ Coordinator +}: Ensure the whole event log has been processed
* In Azure: `Geo::EventLog.maximum(:id)`
* In GCP: `Geo::EventLogState.last_processed.id`
* The two numbers should be the same
1. [x] 🐺 {+ Coordinator +}: Ensure the prospective failover target in GCP is up to date
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p03/check-wal-secondary-sync.sh`
* Assuming the clocks are in sync, this value should be close to 0
* If this is a large number, GCP may not have some data that is in Azure
1. [x] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The loop should be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* `Busy`, `Enqueued`, `Scheduled`, and `Retry` should all be 0
* If a `geo_metrics_update` job is running, that can be ignored
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [x] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.
## Promote the secondary
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [x] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Production: `bin/snapshot-dbs production`
1. [x] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
* The incremental `rsync` commands set off above should be completed by now
* If still ongoing, the DNS update will cause some Pages sites to temporarily revert
1. [x] ☁ {+ Cloud-conductor +}: Update DNS entries to refer to the GCP load-balancers
* Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
* Production **UNTESTED**
- [x] `gitlab.com A 35.231.145.151`
- [x] `altssh.gitlab.com A 35.190.168.187`
- [x] `*.gitlab.io A 35.185.44.232`
- **DO NOT** change `gitlab.io`.
1. [x] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/update-priority.sh
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-priority.sh
```
1. [x] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
* Keep everything, just ensure it’s turned off on the secondaries. The following script will prompt before shutting down postgresql.
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-secondaries.sh
```
1. [x] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
* Keep everything, just ensure it’s turned off. The following script will prompt before shutting down postgresql.
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-primary.sh
```
1. [x] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [x] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/confirm-repmgr.sh
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/connect-pgbouncers.sh
```
1. [x] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [ ] Promote the desired primary
```shell
$ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
```
- [ ] Instruct the remaining standby nodes to follow the new primary
```shell
$ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
```
*Note*: This will fail on the WAL-E node
1. [x] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
```bash
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-gcp-recovery.sh
```
1. [x] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
1. [x] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
* Production: `knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
1. [x] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
1. [x] 🔪 {+ Chef-Runner +}: Ensure that Unicorn processes have been restarted on all hosts
* Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
1. [x] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
* This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
* In a Rails console in GCP:
* Production: `GeoNode.where(url: "https://gprd.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
1. [x] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
* `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
1. [x] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
* `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
1. [x] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
* This will automatically re-enable the disabled sidekiq-cron jobs
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
[ ] Check that sidekiq processes show up in the GitLab admin panel
#### Health check
1. [x] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
* Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
* The old primary in the GCP environment, backed by WAL-E log shipping, will
report "replication lag too large" and "unused replication slot". This is OK.
## During-Blackout QA
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
The details of the QA tasks are listed in the test plan document.
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA automated tests have succeeded
- [x] 🏆 {+ Quality +}: All "during the blackout" QA manual tests have succeeded
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.
The top priority is to maintain data integrity. Failing back after the blackout
window has ended is very difficult, and will result in any changes made in the
interim being lost.
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
Problems may be categorized into three broad causes - "unknown", "missing data",
or "misconfiguration". Testers should focus on determining which bucket
a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
* A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
1. [ ] 🔪 {+ Chef-Runner +}: Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [x] `emails_on_push` queue
1. [ ] `mailers` queue
1. [ ] (`admin_emails` queue doesn't exist any more)
1. [ ] Rotate the password of the incoming@gitlab.com account and update the vault
1. [ ] Run chef-client and restart mailroom:
* `bundle exec knife ssh role:gprd-base-be-mailroom 'sudo chef-client; sudo gitlab-ctl restart mailroom'`
1. [ ] 🐺 {+Coordinator+}: Ensure the secondary can send emails
1. [ ] Run the following in a Rails console (changing `you` to yourself): `Notify.test_email("you+test@gitlab.com", "Test email", "test").deliver_now`
1. [x] Ensure you receive the email
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [x] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
- [ ] Update in chef cookbooks by removing the setting entirely
- [x] Update in the running database
- [x] On the primary server, run `gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Reduce `statement_timeout` to 15s.
- [ ] Merge and chef this change: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334
- [x] Close https://gitlab.com/gitlab-com/migration/issues/686
1. [x] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
* Production:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
* `bundle exec knife ssh role:gitlab-base-lb-pages 'sudo chef-client'`
* Check that https://test-azure-pages-proxy.ur.gs/ and http://test-azure-pages-proxy.ur.gs/ continue to work
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
* Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [x] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [x] ☎ {+ Comms-Handler +}: Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- `GitLab.com planned maintenance for migration to @GCPcloud is almost complete. GitLab.com is available although we're continuing to verify that all systems are functioning correctly. We're live on YouTube``
1. [ ] ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
1. [x] 🏆 {+ Quality +}: Ensure all "after the blackout" QA manual tests have succeeded
## Post migration
1. [ ] 🐺 {+ Coordinator +}: Close the failback issue - it isn't needed
1. [ ] ☁ {+ Cloud-conductor +}: Disable unneeded resources in the Azure environment
completion more effectively
* The Pages LB proxy must be retained
* We should retain all filesystem data for a defined period in case of problems (1 week? 3 months?)
* Unused machines can be switched off
1. [x] ☁ {+ Cloud-conductor +}: Change GitLab settings: [https://gitlab.com/admin/application_settings](https://gitlab.com/admin/application_settings)
* Metrics - Influx -> InfluxDB host should be `performance-01-inf-gprd.c.gitlab-production.internal`Nick ThomasNick Thomas2018-08-11https://dev.gitlab.org/gitlab-com/migration/-/issues/932018-08-11 PRODUCTION switchover attempt: preflight checks2018-08-11T13:02:50Zgcp-migration-bot **only needed for the migration effort**2018-08-11 PRODUCTION switchover attempt: preflight checks# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgI...# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
- Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
1. [ ] 🐺 {+Coordinator+}: Review the switchover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
## GitLab Version and CDN Checks
1. [x] 🐺 {+Coordinator+}: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for `db` nodes (`tier` == `db`) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
- Versions can be confirmed using the Omnibus version tracker dashboards:
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd
1. [x] 🐺 {+Coordinator+}: Ensure that the fastly CDN ip ranges are up-to-date.
- Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
- Production
- GCP `gprd`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gprd-base-lb-fe.json#L56
## Object storage
1. [x] 🐺 {+Coordinator+}: Ensure primary and secondary share the same object storage configuration. For each line below,
execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console,
appending `==`, and pasting the results from the primary console. You should get a `true` or `false` value.
1. [x] `Gitlab.config.uploads`
1. [x] `Gitlab.config.lfs`
1. [x] `Gitlab.config.artifacts`
1. [ ] 🐺 {+Coordinator+}: Ensure all artifacts and LFS objects are in object storage
* If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
1. [ ] `Upload.with_files_stored_locally.count` # => 0
1. [x] `LfsObject.with_files_stored_locally.count` # => 13 (there are a small number of known-lost LFS objects)
1. [ ] `Ci::JobArtifact.with_files_stored_locally.count` # => 0
## Pre-migrated services
1. [x] 🐺 {+Coordinator+}: Check that the container registry has been [pre-migrated to GCP](https://gitlab.com/gitlab-com/migration/issues/466)
## Configuration checks
1. [x] 🐺 {+Coordinator+}: Ensure `gitlab-rake gitlab:geo:check` reports no errors on the primary or secondary
* A warning may be output regarding `AuthorizedKeysCommand`. This is OK, and tracked in [infrastructure#4280](https://gitlab.com/gitlab-com/infrastructure/issues/4280).
1. Compare some files on a representative node (a web worker) between primary and secondary:
1. [x] Manually compare the diff of `/etc/gitlab/gitlab.rb`
1. [x] Manually compare the diff of `/etc/gitlab/gitlab-secrets.json`
1. [x] 🐺 {+Coordinator+}: Check SSH host keys match
* Production:
- [x] `bin/compare-host-keys gitlab.com gprd.gitlab.com`
- [x] `SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com`
1. [x] 🐺 {+Coordinator+}: Ensure repository and wiki verification feature flag shows as enabled on both **primary** and **secondary**
* `Feature.enabled?(:geo_repository_verification)`
1. [x] 🐺 {+Coordinator+}: Ensure the TTL for affected DNS records is low
* 300 seconds is fine
* Production:
- [x] `gitlab.com`
- [x] `altssh.gitlab.com`
- [x] `gitlab-org.gitlab.io`
1. [x] 🐺 {+Coordinator+}: Ensure SSL configuration on the secondary is valid for primary domain names too
* Handy script in the migration repository: `bin/check-ssl <hostname>:<port>`
* Production:
- [x] `bin/check-ssl gprd.gitlab.com:443`
- [x] `bin/check-ssl gitlab-org.gprd.gitlab.io:443`
1. [x] 🔪 {+Chef-Runner+}: Ensure SSH connectivity to all hosts, including host key verification
* `chef-client role:gitlab-base pwd`
1. [x] 🔪 {+Chef-Runner+}: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
1. [x] `bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'`
1. [x] 🔪 {+Chef-Runner+}: Ensure that mailroom nodes have been configured with the right roles:
* Production: `bundle exec knife ssh "role:gprd-base-be-mailroom" hostname`
1. [x] 🔪 {+ Chef-Runner +}: Ensure all hot-patches are applied to the target environment:
1. Fetch the latest version of [post-deployment-patches](https://dev.gitlab.org/gitlab/post-deployment-patches/)
1. Check the omnibus version running in the target environment
* Production: `knife role show gprd-omnibus-version | grep version:`
1. In `post-deployment-patches`, ensure that the version maninfest has a corresponding GCP Chef role under the target environment
* E.g. In `11.1/MANIFEST.yml`, `versions.11.1.0-rc10-ee.environments.staging` should have `gstg-base-fe-api` along with `staging-base-fe-api`
1. Run `gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod`
* The command can fail because the patches may have already been applied, that's OK.
1. [x] 🔪 {+Chef-Runner+}: Outstanding merge requests are up to date vs. `master`:
* Production:
* [x] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243)
* [x] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254)
* [x] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218)
* [x] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987)
* [x] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322)
* [x] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)
1. [x] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
## Ensure Geo replication is up to date
1. [x] 🐺 {+Coordinator+}: Ensure database replication is healthy and up to date
* Create a test issue on the primary and wait for it to appear on the secondary
* This should take less than 5 minutes at most
1. [ ] 🐺 {+Coordinator+}: Ensure sidekiq is healthy
* `Busy` + `Enqueued` + `Retries` should total less than 10,000, with fewer than 100 retries
* `Scheduled` jobs should not be present, or should all be scheduled to be run before the switchover starts
* Production: https://gitlab.com/admin/background_jobs
* From a rails console: `Sidekiq::Stats.new`
* "Dead" jobs will be lost on switchover but can be ignored as we routinely ignore them
* "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
1. [x] 🐺 {+Coordinator+}: Ensure **repositories** and **wikis** are at least 99% complete, 0 failed (that’s zero, not 0%):
* Production: https://gitlab.com/admin/geo_nodes
* Observe the "Sync Information" tab for the secondary
* See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
1. [x] 🐺 {+Coordinator+}: Local **CI artifacts**, **LFS objects** and **Uploads** should have 0 in all columns
* Production: this may fluctuate around 0 due to background upload. This is OK.
1. [x] 🐺 {+Coordinator+}: Ensure Geo event log is being processed
* In a rails console for both primary and secondary: `Geo::EventLog.maximum(:id)`
* This may be `nil`. If so, perform a `git push` to a random project to generate a new event
* In a rails console for the secondary: `Geo::EventLogState.last_processed`
* All numbers should be within 10,000 of each other.
1. [ ] 🐺 {+ Coordinator +}: Reconcile negative registry entries
* Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md
## Verify the integrity of replicated repositories and wikis
1. [x] 🐺 {+Coordinator+}: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Review the numbers under the `Verification Information` tab for the
**secondary** node
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
1. No need to verify the integrity of anything in object storage
## Perform an automated QA run against the current infrastructure
1. [x] 🏆 {+ Quality +}: Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
1. [x] 🏆 {+ Quality +}: Post the result in the test plan issue. This will be used as the yardstick to compare the "During switchover" automated QA run against.
## Schedule the switchover
1. [x] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +}, 🏆 {+ Quality +}, and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
1. [x] 🐺 {+Coordinator+}: Pick a date and time for the switchover itself that won't interfere with the release team's work.
1. [x] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
1. [x] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "test plan" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=test_plan)
1. [x] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failback" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failback)Nick ThomasNick Thomas2018-08-10https://dev.gitlab.org/gitlab-com/migration/-/issues/922018-08-09 STAGING failover attempt: main procedure2018-08-09T11:27:46ZJohn Jarvis2018-08-09 STAGING failover attempt: main procedure# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordina...# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator | @nick |
| 🔪 Chef-Runner | @ahmadsherif |
| ☎ Comms-Handler | @dawsmith |
| 🐘 Database-Wrangler | @jarv |
| ☁ Cloud-conductor | @ahmadsherif |
| 🏆 Quality | @meks |
| ↩ Fail-back Handler (_Staging Only_) | @ahmadsherif |
| 🎩 Head Honcho (_Production Only_) | @edjdev |
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
# Immediately
Perform these steps when the issue is created.
- [ ] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
- [ ] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
- Start Time: `__MAINTENANCE_START_TIME__` & End Time: `__MAINTENANCE_END_TIME__`
- Google Working Doc: __GOOGLE_DOC_URL__ (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- **PRODUCTION ONLY** Blog Post: __BLOG_POST_URL__
- **PRODUCTION ONLY** End Time: __MAINTENANCE_END_TIME__
# Support Options
| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
# Database hosts
## Staging
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
## Production
```mermaid
graph TD;
postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
postgres02a --> postgres03a["postgres-03.db.prd"];
postgres02a --> postgres04a["postgres-04.db.prd"];
postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
postgres01g --> postgres02g["postgres-02-db-gprd"];
postgres01g --> postgres03g["postgres-03-db-gprd"];
postgres01g --> postgres04g["postgres-04-db-gprd"];
```
# Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.
* Staging:
* Azure: `web-01.sv.stg.gitlab.com`
* GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
* Production:
* Azure: `web-01.sv.prd.gitlab.com`
* GCP: `web-01-sv-gprd.c.gitlab-production.internal`
# Dashboards and debugging
* These dashboards might be useful during the failover:
* Staging:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* Sentry includes application errors. At present, Azure and GCP log to the same Sentry instance
* Staging: https://sentry.gitlap.com/gitlab/staginggitlabcom/
* Production:
* Workhorse: https://sentry.gitlap.com/gitlab/gitlab-workhorse-gitlabcom/
* Rails (backend): https://sentry.gitlap.com/gitlab/gitlabcom/
* Rails (frontend): https://sentry.gitlap.com/gitlab/gitlabcom-clientside/
* Gitaly (golang): https://sentry.gitlap.com/gitlab/gitaly-production/
* Gitaly (ruby): https://sentry.gitlap.com/gitlab/gitlabcom-gitaly-ruby/
* The logs can be used to inspect any area of the stack in more detail
* https://log.gitlab.net/
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google
1. [ ] ☎ {+ Comms-Handler +}: announce in #general slack and on team call date of failover.
1. [ ] ☎ {+ Comms-Handler +}: Marketing team publish blog post about upcoming GCP failover
1. [ ] ☎ {+ Comms-Handler +}: Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
1. [ ] ☎ {+ Comms-Handler +}: Ensure that YouTube stream will be available for Zoom call
1. [ ] ☎ {+ Comms-Handler +}: Tweet blog post from `@gitlab` and `@gitlabstatus`
- `Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from __MAINTENANCE_START_TIME__ - __MAINTENANCE_END_TIME__ UTC. Follow @gitlabstatus for more details. __BLOG_POST_URL__`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/040_t-3h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 3600)`
# T minus 1 hour (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/050_t-1h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__`
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `/opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
1. [ ] **PRODUCTION ONLY** 🔪 {+ Chef-Runner +}: Silence production alerts
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
* `provider`: `azure`
* `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [ ] ☎ {+ Comms-Handler +}: Create a broadcast message
* Staging: https://staging.gitlab.com/admin/broadcast_messages
* Production: https://gitlab.com/admin/broadcast_messages
* Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from __MAINTENANCE_START_TIME__ on __FAILOVER_DATE__ UTC`
* Start date: now
* End date: now + 3 hours
1. [ ] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
* `sudo crontab -e` to get an editor window, comment out the line involving a pages-sync script
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after the transfer starts will be lost.
* The user running the rsync _must_ have full sudo access on both azure and gcp pages.
* Very manual, looks a little like the following at present:
* Staging:
```
ssh 10.133.2.161 # nfs-pages-staging-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
* Production:
```
ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
## Failover Call
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.
Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:
* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.
### Roll call
- [ ] 🐺 {+ Comms-Handler +}: make sure Youtube stream is started
- [ ] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [ ] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call
### Notify Users of Maintenance Window
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- `GitLab.com planned maintenance for migration to @GCPcloud is starting. See you on the other side! We'll be live on YouTube`
### Monitoring
- [ ] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
* Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe0{1..9}.lb.gitlab.com fe{10..16}.lb.gitlab.com altssh0{1..2}.lb.gitlab.com`
### Health check
1. [ ] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
* Staging
* GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
* Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
* Production
* GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
* Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
# T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
## Failover Procedure
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Run `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Run `knife ssh roles:staging-base-fe-git 'sudo chef-client'`
* Production:
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
* Run `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
* Run `knife ssh roles:gitlab-base-fe-git 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
* Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
1. [ ] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
1. [ ] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh`
1. [ ] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
* In a separate terminal on the deploy host: `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* The loop should be stopped once sidekiq is shut down
* Wait for `--> Status: PROCEED`
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://performance.gitlab.net/d/000000286/gcp-failover-azure?orgId=1&var-environment=stg
* Production: https://performance.gitlab.net/d/000000286/gcp-failover-azure?orgId=1&var-environment=prd
* Wait for the number of `unverified` repositories and wikis to reach 0
* Resolve any repositories that have `failed` verification
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs
* Press `Queues -> Live Poll`
* Wait for all queues not mentioned above to reach 0
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
* On staging, the repository verification queue may not empty
1. [ ] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
* Staging: https://staging.gitlab.com/admin/sidekiq/retries
* Production: https://gitlab.com/admin/sidekiq/retries
* **NOTE**: This tab may contain confidential information. Do this out of screen capture!
* Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
* Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
* Press "Retry All" to attempt to retry all remaining jobs immediately
* Repeat until 0 retries are present
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [ ] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [ ] 🐺 {+ Coordinator +}: Flush CI traces in Redis to the database
* In a Rails console in Azure:
* `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Press "Sync Information"
* Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
* On staging, this may not complete
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
* Staging: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
* You can also use `sudo gitlab-rake geo:status`
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
* On staging, verification may not complete
1. [ ] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p03/check-wal-secondary-sync.sh`
* Assuming the clocks are in sync, this value should be close to 0
* If this is a large number, GCP may not have some data that is in Azure
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The loop should be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* `Busy`, `Enqueued`, `Scheduled`, and `Retry` should all be 0
* If a `geo_metrics_update` job is running, that can be ignored
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [ ] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.
## Promote the secondary
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
* The incremental `rsync` commands set off above should be completed by now
* If still ongoing, the DNS update will cause some Pages sites to temporarily revert
1. [ ] ☁ {+ Cloud-conductor +}: Update DNS entries to refer to the GCP load-balancers
* Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
* Staging
- [ ] `staging.gitlab.com A 35.227.123.228`
- [ ] `altssh.staging.gitlab.com A 35.185.33.132`
- [ ] `*.staging.gitlab.io A 35.229.69.78`
- **DO NOT** change `staging.gitlab.io`.
* Production **UNTESTED**
- [ ] `gitlab.com A 35.231.145.151`
- [ ] `altssh.gitlab.com A 35.190.168.187`
- [ ] `*.gitlab.io A 35.185.44.232`
- **DO NOT** change `gitlab.io`.
1. [ ] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/update-priority.sh
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-priority.sh
```
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
* Keep everything, just ensure it’s turned off on the secondaries. The following script will prompt before shutting down postgresql.
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-secondaries.sh
```
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
* Keep everything, just ensure it’s turned off. The following script will prompt before shutting down postgresql.
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-primary.sh
```
1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/confirm-repmgr.sh
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/connect-pgbouncers.sh
```
1. [ ] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [ ] Promote the desired primary
```shell
$ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
```
- [ ] Instruct the remaining standby nodes to follow the new primary
```shell
$ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
```
*Note*: This will fail on the WAL-E node
1. [ ] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
```bash
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-gcp-recovery.sh
```
1. [ ] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
1. [ ] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
* **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
* **PRODUCTION** **UNTESTED** `knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
* Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that important processes have been restarted on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* [ ] Unicorn
* [ ] Sidekiq
* [ ] Gitlab Pages
1. [ ] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
* This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
* In a Rails console in GCP:
* Staging: `GeoNode.where(url: "https://gstg.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
* Production: `GeoNode.where(url: "https://gprd.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
1. [ ] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
* `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
1. [ ] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
* `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
1. [ ] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
* This will automatically re-enable the disabled sidekiq-cron jobs
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
[ ] Check that sidekiq processes show up in the GitLab admin panel
#### Health check
1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
* Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
* Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
* The old primary in the GCP environment, backed by WAL-E log shipping, will
report "replication lag too large" and "unused replication slot". This is OK.
## During-Blackout QA
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
The details of the QA tasks are listed in the test plan document.
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA automated tests have succeeded
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA manual tests have succeeded
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.
The top priority is to maintain data integrity. Failing back after the blackout
window has ended is very difficult, and will result in any changes made in the
interim being lost.
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
Problems may be categorized into three broad causes - "unknown", "missing data",
or "misconfiguration". Testers should focus on determining which bucket
a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
* A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [ ] `emails_on_push` queue
1. [ ] `mailers` queue
1. [ ] (`admin_emails` queue doesn't exist any more)
1. [ ] Rotate the password of the incoming@gitlab.com account and update the vault
1. [ ] Run chef-client and restart mailroom:
* `bundle exec knife ssh role:gprd-base-be-mailroom 'sudo chef-client; sudo gitlab-ctl restart mailroom'`
1. [ ] 🐺 {+Coordinator+}: **PRODUCTION ONLY** Ensure the secondary can send emails
1. [ ] Run the following in a Rails console (changing `you` to yourself): `Notify.test_email("you+test@gitlab.com", "Test email", "test").deliver_now`
1. [ ] Ensure you receive the email
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
- [ ] Update in chef cookbooks by removing the setting entirely
- [ ] Update in the running database
- [ ] On the primary server, run `gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Reduce `statement_timeout` to 15s.
- [ ] Merge and chef this change: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334
- [ ] Close https://gitlab.com/gitlab-com/migration/issues/686
1. [ ] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
* Staging:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* `bundle exec knife ssh role:staging-base-lb-pages 'sudo chef-client'`
* Production:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
* `bundle exec knife ssh role:gitlab-base-lb-pages 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
* Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
* Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- `GitLab.com planned maintenance for migration to @GCPcloud is almost complete. GitLab.com is available although we're continuing to verify that all systems are functioning correctly. We're live on YouTube``
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA manual tests have succeeded
## **PRODUCTION ONLY** Post migration
1. [ ] 🐺 {+ Coordinator +}: Close the failback issue - it isn't needed
1. [ ] ☁ {+ Cloud-conductor +}: Disable unneeded resources in the Azure environment
completion more effectively
* The Pages LB proxy must be retained
* We should retain all filesystem data for a defined period in case of problems (1 week? 3 months?)
* Unused machines can be switched off
1. [ ] ☁ {+ Cloud-conductor +}: Change GitLab settings: [https://gprd.gitlab.com/admin/application_settings](https://gprd.gitlab.com/admin/application_settings)
* Metrics - Influx -> InfluxDB host should be `performance-01-inf-gprd.c.gitlab-production.internal`https://dev.gitlab.org/gitlab-com/migration/-/issues/912018-08-09 STAGING switchover attempt: failback2018-08-11T13:02:27Zgcp-migration-bot **only needed for the migration effort**2018-08-09 STAGING switchover attempt: failback# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. b...# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. before opening it up to the public), it will also be super-useful to have
this documented and tested.
The priority is to get the Azure site working again as quickly as possible. As
the GCP side will be inaccessible, returning it to operation is of secondary
importance.
This issue should not be closed until both Azure and GCP sites are in full
working order, including database replication between the two sites.
## Fail back to the Azure site
1. [ ] ↩️ {+ Fail-back Handler +}: Make the GCP environment **inaccessible** again, if necessary
1. Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
1. Production: ???
1. [ ] ↩️ {+ Fail-back Handler +}: Update the DNS entries to refer to the Azure load balancer
1. Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
1. Staging
- [ ] `staging.gitlab.com A 40.84.60.110`
- [ ] `altssh.staging.gitlab.com A 104.46.121.194`
- [ ] `*.staging.gitlab.io CNAME pages01.stg.gitlab.com`
1. Production
- [ ] `gitlab.com A 52.167.219.168`
- [ ] `altssh.gitlab.com A 52.167.133.162`
- [ ] `*.gitlab.io A 52.167.214.135`
1. [ ] OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback.
- [ ] Start the primary Azure node
```shell
azure_primary# gitlab-ctl start postgresql
```
- [ ] Remove nodes from the Azure repmgr cluster.
```shell
azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
- [ ] In a tmux or screen session on the Azure standby node, resync the database
```shell
azure_standby# PGSSLMODE=disable gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
```
**Note**: This can run for several hours. Do not wait for completion.
- [ ] Remove Azure nodes from the GCP cluster by running this on the GCP primary
```shell
gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
1. [ ] Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it
**Note:** Skip this if introducing a postgresql split-brain
1. [ ] Ensure that repmgr priorities for GCP are -1. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=-1 where name like '%gstg%'"
```
1. [ ] Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: `gitlab-ctl stop postgresql`
1. [ ] Start postgresql on the Azure staging primary node `gitlab-ctl start postgresql`
1. [ ] Ensure `gitlab-ctl repmgr cluster show` reports an Azure node as the primary in Azure:
```shell
gitlab-ctl repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------
* master | postgres02.db.stg.gitlab.com | | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
standby | postgres01.db.stg.gitlab.com | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
```
1. [ ] Start Azure secondaries
* Start postgresql on the Azure staging secondary node `gitlab-ctl start postgresql`
* Verify it replicates from the primary. On the primary take a look at `SELECT * FROM pg_stat_replication` which should include the newly started secondary.
* Production: Repeat the above for other Azure secondaries. Start one after the other.
1. [ ] ↩️ {+ Fail-back Handler +}: **Verify that the DNS update has propagated**
back online
1. [ ] ↩️ {+ Fail-back Handler +}: Start sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
1. [ ] ↩️ {+ Fail-back Handler +}: Restore the Azure Pages load-balancer configuration
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
1. [ ] ↩️ {+ Fail-back Handler +}: Set the GitLab shared runner timeout back to 3 hours
1. [ ] ↩️ {+ Fail-back Handler +}: Restart automatic incremental GitLab Pages sync
* Enable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, uncomment the line involving rsync
1. [ ] ↩️ {+ Fail-back Handler +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
1. [ ] ↩️ {+ Fail-back Handler +}: Enable access to the azure environment from the
outside world
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
## Restore the GCP site to being a working secondary
1. [ ] ↩️ {+ Fail-back Handler +}: Turn the GCP site back into a secondary
* Undo the chef-repo changes. If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* `bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
* `bundle exec knife role from file roles/gprd-base-fe-web.json roles/gprd-base.json`
1. [ ] Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster
1. Remove the old data with `rm -rf /var/opt/gitlab/postgresql/data`
1. Re-initialize the database by running:
**Note:** This step can take over an hour. Consider running it in a screen/tmux session.
```shell
# su gitlab-psql -c "PGSSLMODE=disable /opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr"
```
1. Start the database with `gitlab-ctl start postgresql`
1. Register the database with the cluster by running `gitlab-ctl repmgr standby register`
1. [ ] ↩️ {+ Fail-back Handler +}: Reconfigure every changed gstg node
1. bundle exec knife ssh roles:gstg-base "sudo chef-client"
1. [ ] ↩️ {+ Fail-back Handler +}: Clear cache on gstg web nodes to correct broadcast message cache
* `sudo gitlab-rake cache:clear:redis`
1. [ ] ↩️ {+ Fail-back Handler +}: Restart Unicorn and Sidekiq
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'`
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'`
1. [ ] ↩️ {+ Fail-back Handler +}: Verify database replication is working
1. Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
1. [ ] ↩️ {+ Fail-back Handler +}: Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top
1. [ ] ↩️ {+ Fail-back Handler +}: Confirm pgbouncer is talking to the correct hosts
* `sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/pgbouncer -U pgbouncer -d pgbouncer -p 6432`
* SQL: `SHOW DATABASES;`
1. [ ] ↩️ {+ Fail-back Handler +}: It is now safe to delete the database server snapshotsAhmad SherifAhmad Sherifhttps://dev.gitlab.org/gitlab-com/migration/-/issues/892018-08-09 STAGING failover attempt: main procedure2018-08-09T18:11:59Zgcp-migration-bot **only needed for the migration effort**2018-08-09 STAGING failover attempt: main procedure# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordina...# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator | @nick |
| 🔪 Chef-Runner | @ahmadsherif |
| ☎ Comms-Handler | @dawsmith |
| 🐘 Database-Wrangler | @jarv |
| ☁ Cloud-conductor | @ahmadsherif |
| 🏆 Quality | @remy |
| ↩ Fail-back Handler (_Staging Only_) | @ahmadsherif |
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
# Immediately
Perform these steps when the issue is created.
- [x] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
- [x] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
- Start Time: `1300` & End Time: `1500`
- Google Working Doc: https://docs.google.com/document/d/18vGk6dQs7L0oGQOb_bNiFa5JhwLq5WBS7oNxQy09ml8/edit (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- **PRODUCTION ONLY** Blog Post: https://about.gitlab.com/2018/07/19/gcp-move-update/
- **PRODUCTION ONLY** End Time: 1500
# Support Options
| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
# Database hosts
## Staging
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
# Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.
* Staging:
* Azure: `web-01.sv.stg.gitlab.com`
* GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
# Dashboards and debugging
* These dashboards might be useful during the failover:
* Staging:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Sentry includes application errors. At present, Azure and GCP log to the same Sentry instance
* Staging: https://sentry.gitlap.com/gitlab/staginggitlabcom/
* The logs can be used to inspect any area of the stack in more detail
* https://log.gitlab.net/
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [x] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
# T minus 3 hours (2018-08-09) [📁](bin/scripts/02_failover/040_t-3h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
1. [x] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 3600)`
# T minus 1 hour (2018-08-09) [📁](bin/scripts/02_failover/050_t-1h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
1. [x] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `/opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [x] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2374
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
- [x] ☎ {+ Comms-Handler +}: Create a broadcast message
* Staging: https://staging.gitlab.com/admin/broadcast_messages
* Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from 1300 on 2018-08-09 UTC`
* Start date: now
* End date: now + 3 hours
1. [x] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
1. [x] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
* `sudo crontab -e` to get an editor window, comment out the line involving a pages-sync script
1. [x] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after the transfer starts will be lost.
* The user running the rsync _must_ have full sudo access on both azure and gcp pages.
* Very manual, looks a little like the following at present:
* Staging:
```
ssh 10.133.2.161 # nfs-pages-staging-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
## Failover Call
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.
Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:
* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.
### Roll call
- [ ] 🐺 {+ Comms-Handler +}: make sure Youtube stream is started
- [x] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [x] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call
### Monitoring
- [x] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
### Health check
1. [x] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
* Staging
* GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
* Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
# T minus zero (failover day) (2018-08-09) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
## Failover Procedure
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [x] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2375
* Run `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Run `knife ssh roles:staging-base-fe-git 'sudo chef-client'`
1. [x] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
1. [x] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [x] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
1. [x] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
* In a separate terminal on the deploy host: `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* The loop should be stopped once sidekiq is shut down
* Wait for `--> Status: PROCEED`
1. [x] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://performance.gitlab.net/d/000000286/gcp-failover-azure?orgId=1&var-environment=stg
* Wait for the number of `unverified` repositories and wikis to reach 0
* Resolve any repositories that have `failed` verification
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Staging: https://staging.gitlab.com/admin/background_jobs
* Press `Queues -> Live Poll`
* Wait for all queues not mentioned above to reach 0
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
* On staging, the repository verification queue may not empty
1. [x] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
* Staging: https://staging.gitlab.com/admin/sidekiq/retries
* **NOTE**: This tab may contain confidential information. Do this out of screen capture!
* Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
* Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
* Press "Retry All" to attempt to retry all remaining jobs immediately
* Repeat until 0 retries are present
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [x] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [x] 🐺 {+ Coordinator +}: Flush CI traces in Redis to the database
* In a Rails console in Azure:
* `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Press "Sync Information"
* Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
* On staging, this may not complete
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
* Staging: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
* You can also use `sudo gitlab-rake geo:status`
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
* On staging, verification may not complete
1. [x] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
1. [x] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p03/check-wal-secondary-sync.sh`
* Assuming the clocks are in sync, this value should be close to 0
* If this is a large number, GCP may not have some data that is in Azure
1. [x] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The loop should be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* `Busy`, `Enqueued`, `Scheduled`, and `Retry` should all be 0
* If a `geo_metrics_update` job is running, that can be ignored
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [x] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.
## Promote the secondary
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [x] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
1. [x] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
* The incremental `rsync` commands set off above should be completed by now
* If still ongoing, the DNS update will cause some Pages sites to temporarily revert
1. [x] ☁ {+ Cloud-conductor +}: Update DNS entries to refer to the GCP load-balancers
* Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
* Staging
- [ ] `staging.gitlab.com A 35.227.123.228`
- [ ] `altssh.staging.gitlab.com A 35.185.33.132`
- [ ] `*.staging.gitlab.io A 35.229.69.78`
- **DO NOT** change `staging.gitlab.io`.
1. [x] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/update-priority.sh
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-priority.sh
```
1. [x] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
* Keep everything, just ensure it’s turned off on the secondaries. The following script will prompt before shutting down postgresql.
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-secondaries.sh
```
1. [x] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
* Keep everything, just ensure it’s turned off. The following script will prompt before shutting down postgresql.
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-primary.sh
```
1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [x] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/confirm-repmgr.sh
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/connect-pgbouncers.sh
```
1. [x] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [x] Promote the desired primary
```shell
$ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
```
- [x] Instruct the remaining standby nodes to follow the new primary
```shell
$ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
```
*Note*: This will fail on the WAL-E node
1. [x] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
```bash
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-gcp-recovery.sh
```
1. [x] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2376
1. [x] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
* **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
1. [x] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
1. [x] 🔪 {+ Chef-Runner +}: Ensure that important processes have been restarted on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* [x] Unicorn
* [ ] Sidekiq
* [ ] Gitlab Pages
1. [x] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
* This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
* In a Rails console in GCP:
* Staging: `GeoNode.where(url: "https://gstg.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
1. [x] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
* `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
1. [x] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
* `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
1. [x] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
* This will automatically re-enable the disabled sidekiq-cron jobs
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
[ ] Check that sidekiq processes show up in the GitLab admin panel
#### Health check
1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
* Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
* The old primary in the GCP environment, backed by WAL-E log shipping, will
report "replication lag too large" and "unused replication slot". This is OK.
## During-Blackout QA
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
The details of the QA tasks are listed in the test plan document.
- [x] 🏆 {+ Quality +}: All "during the blackout" QA automated tests have succeeded
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA manual tests have succeeded
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.
The top priority is to maintain data integrity. Failing back after the blackout
window has ended is very difficult, and will result in any changes made in the
interim being lost.
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
Problems may be categorized into three broad causes - "unknown", "missing data",
or "misconfiguration". Testers should focus on determining which bucket
a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
* A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
(all prod-only)
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [x] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
* Staging:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* `bundle exec knife ssh role:staging-base-lb-pages 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
* Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
1. [x] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [x] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
1. [x] 🏆 {+ Quality +}: Ensure all "after the blackout" QA manual tests have succeeded
## **PRODUCTION ONLY** Post migration
(all removed)Nick ThomasNick Thomashttps://dev.gitlab.org/gitlab-com/migration/-/issues/882018-08-09 STAGING failover attempt: preflight checks2018-08-09T18:07:13Zgcp-migration-bot **only needed for the migration effort**2018-08-09 STAGING failover attempt: preflight checks# Pre-flight checks
## Dashboards and Alerts
1. [x] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1...# Pre-flight checks
## Dashboards and Alerts
1. [x] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
- Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
1. [x] 🐺 {+Coordinator+}: Review the failover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
## GitLab Version and CDN Checks
1. [x] 🐺 {+Coordinator+}: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for `db` nodes (`tier` == `db`) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
- Versions can be confirmed using the Omnibus version tracker dashboards:
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
1. [x] 🐺 {+Coordinator+}: Ensure that the fastly CDN ip ranges are up-to-date.
- Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
- Staging
- GCP `gstg`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gstg-base-lb-fe.json#L48
## Object storage
1. [ ] 🐺 {+Coordinator+}: Ensure primary and secondary share the same object storage configuration. For each line below,
execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console,
appending `==`, and pasting the results from the primary console. You should get a `true` or `false` value.
1. [x] `Gitlab.config.uploads`
1. [x] `Gitlab.config.lfs`
1. [ ] `Gitlab.config.artifacts`
1. [x] 🐺 {+Coordinator+}: Ensure all artifacts and LFS objects are in object storage
* If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
* On staging, these numbers are non-zero. Just mark as checked.
1. [x] `Upload.with_files_stored_locally.count` # => 0
1. [x] `LfsObject.with_files_stored_locally.count` # => 13 (there are a small number of known-lost LFS objects)
1. [x] `Ci::JobArtifact.with_files_stored_locally.count` # => 0
## Pre-migrated services
1. [x] 🐺 {+Coordinator+}: Check that the container registry has been [pre-migrated to GCP](https://gitlab.com/gitlab-com/migration/issues/466)
## Configuration checks
1. [x] 🐺 {+Coordinator+}: Ensure `gitlab-rake gitlab:geo:check` reports no errors on the primary or secondary
* A warning may be output regarding `AuthorizedKeysCommand`. This is OK, and tracked in [infrastructure#4280](https://gitlab.com/gitlab-com/infrastructure/issues/4280).
1. Compare some files on a representative node (a web worker) between primary and secondary:
1. [ ] Manually compare the diff of `/etc/gitlab/gitlab.rb`
1. [ ] Manually compare the diff of `/etc/gitlab/gitlab-secrets.json`
1. [x] 🐺 {+Coordinator+}: Check SSH host keys match
* Staging:
- [x] `bin/compare-host-keys staging.gitlab.com gstg.gitlab.com`
- [x] `SSH_PORT=443 bin/compare-host-keys altssh.staging.gitlab.com altssh.gstg.gitlab.com`
1. [x] 🐺 {+Coordinator+}: Ensure repository and wiki verification feature flag shows as enabled on both **primary** and **secondary**
* `Feature.enabled?(:geo_repository_verification)`
1. [x] 🐺 {+Coordinator+}: Ensure the TTL for affected DNS records is low
* 300 seconds is fine
* Staging:
- [x] `staging.gitlab.com`
- [x] `altssh.staging.gitlab.com`
- [x] `gitlab-org.staging.gitlab.io`
1. [x] 🐺 {+Coordinator+}: Ensure SSL configuration on the secondary is valid for primary domain names too
* Handy script in the migration repository: `bin/check-ssl <hostname>:<port>`
* Staging:
- [x] `bin/check-ssl gstg.gitlab.com:443`
- [x] `bin/check-ssl gitlab-org.gstg.gitlab.io:443`
1. [x] 🔪 {+Chef-Runner+}: Ensure SSH connectivity to all hosts, including host key verification
* `chef-client role:gitlab-base pwd`
1. [x] 🔪 {+Chef-Runner+}: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
1. [x] `bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'`
1. [x] 🔪 {+Chef-Runner+}: Ensure that mailroom nodes have been configured with the right roles:
* Staging: `bundle exec knife ssh "role:gstg-base-be-mailroom" hostname`
1. [x] 🔪 {+ Chef-Runner +}: Ensure all hot-patches are applied to the target environment:
1. Fetch the latest version of [post-deployment-patches](https://dev.gitlab.org/gitlab/post-deployment-patches/)
1. Check the omnibus version running in the target environment
* Staging: `knife role show gstg-omnibus-version | grep version:`
1. In `post-deployment-patches`, ensure that the version maninfest has a corresponding GCP Chef role under the target environment
* E.g. In `11.1/MANIFEST.yml`, `versions.11.1.0-rc10-ee.environments.staging` should have `gstg-base-fe-api` along with `staging-base-fe-api`
1. Run `gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod`
* The command can fail because the patches may have already been applied, that's OK.
1. [ ] 🔪 {+Chef-Runner+}: Outstanding merge requests are up to date vs. `master`:
* Staging:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2374)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2375)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2376)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2333)
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
## Ensure Geo replication is up to date
1. [x] 🐺 {+Coordinator+}: Ensure database replication is healthy and up to date
* Create a test issue on the primary and wait for it to appear on the secondary
* This should take less than 5 minutes at most
1. [x] 🐺 {+Coordinator+}: Ensure sidekiq is healthy
* `Busy` + `Enqueued` + `Retries` should total less than 10,000, with fewer than 100 retries
* `Scheduled` jobs should not be present, or should all be scheduled to be run before the failover starts
* Staging: https://staging.gitlab.com/admin/background_jobs
* From a rails console: `Sidekiq::Stats.new`
* "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
* "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
1. [x] 🐺 {+Coordinator+}: Ensure **repositories** and **wikis** are at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://staging.gitlab.com/admin/geo_nodes
* Observe the "Sync Information" tab for the secondary
* See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
* Staging: some failures and unsynced repositories are expected
1. [x] 🐺 {+Coordinator+}: Local **CI artifacts**, **LFS objects** and **Uploads** should have 0 in all columns
* Staging: some failures and unsynced files are expected
1. [x] 🐺 {+Coordinator+}: Ensure Geo event log is being processed
* In a rails console for both primary and secondary: `Geo::EventLog.maximum(:id)`
* This may be `nil`. If so, perform a `git push` to a random project to generate a new event
* In a rails console for the secondary: `Geo::EventLogState.last_processed`
* All numbers should be within 10,000 of each other.
## Verify the integrity of replicated repositories and wikis
1. [x] 🐺 {+Coordinator+}: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Review the numbers under the `Verification Information` tab for the
**secondary** node
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
1. No need to verify the integrity of anything in object storage
## Perform an automated QA run against the current infrastructure
1. [x] 🏆 {+ Quality +}: Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
1. [x] 🏆 {+ Quality +}: Post the result in the test plan issue. This will be used as the yardstick to compare the "During failover" automated QA run against.
## Schedule the failover
1. [x] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +}, 🏆 {+ Quality +}, and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
1. [x] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work.
1. [x] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK
1. [x] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
1. [x] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "test plan" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=test_plan)
1. [x] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failback" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failback)
1. [ ] 🐺 {+Coordinator+}: Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issuesNick ThomasNick Thomashttps://dev.gitlab.org/gitlab-com/migration/-/issues/872018-08-08 STAGING failover attempt: main procedure2018-08-07T18:50:14Zgcp-migration-bot **only needed for the migration effort**2018-08-08 STAGING failover attempt: main procedure# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordina...# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator | __TEAM_COORDINATOR__ |
| 🔪 Chef-Runner | __TEAM_CHEF_RUNNER__ |
| ☎ Comms-Handler | __TEAM_COMMS_HANDLER__ |
| 🐘 Database-Wrangler | __TEAM_DATABASE_WRANGLER__ |
| ☁ Cloud-conductor | __TEAM_CLOUD_CONDUCTOR__ |
| 🏆 Quality | __TEAM_QUALITY__ |
| ↩ Fail-back Handler (_Staging Only_) | __TEAM_FAILBACK_HANDLER__ |
| 🎩 Head Honcho (_Production Only_) | __TEAM_HEAD_HONCHO__ |
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
# Immediately
Perform these steps when the issue is created.
- [ ] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
- [ ] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
- Start Time: `__MAINTENANCE_START_TIME__` & End Time: `__MAINTENANCE_END_TIME__`
- Google Working Doc: __GOOGLE_DOC_URL__ (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- **PRODUCTION ONLY** Blog Post: __BLOG_POST_URL__
- **PRODUCTION ONLY** End Time: __MAINTENANCE_END_TIME__
# Support Options
| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
# Database hosts
## Staging
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
## Production
```mermaid
graph TD;
postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
postgres02a --> postgres03a["postgres-03.db.prd"];
postgres02a --> postgres04a["postgres-04.db.prd"];
postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
postgres01g --> postgres02g["postgres-02-db-gprd"];
postgres01g --> postgres03g["postgres-03-db-gprd"];
postgres01g --> postgres04g["postgres-04-db-gprd"];
```
# Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.
* Staging:
* Azure: `web-01.sv.stg.gitlab.com`
* GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
* Production:
* Azure: `web-01.sv.prd.gitlab.com`
* GCP: `web-01-sv-gprd.c.gitlab-production.internal`
# Dashboards and debugging
* These dashboards might be useful during the failover:
* Staging:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* Sentry includes application errors. At present, Azure and GCP log to the same Sentry instance
* Staging: https://sentry.gitlap.com/gitlab/staginggitlabcom/
* Production:
* Workhorse: https://sentry.gitlap.com/gitlab/gitlab-workhorse-gitlabcom/
* Rails (backend): https://sentry.gitlap.com/gitlab/gitlabcom/
* Rails (frontend): https://sentry.gitlap.com/gitlab/gitlabcom-clientside/
* Gitaly (golang): https://sentry.gitlap.com/gitlab/gitaly-production/
* Gitaly (ruby): https://sentry.gitlap.com/gitlab/gitlabcom-gitaly-ruby/
* The logs can be used to inspect any area of the stack in more detail
* https://log.gitlab.net/
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google
1. [ ] ☎ {+ Comms-Handler +}: announce in #general slack and on team call date of failover.
1. [ ] ☎ {+ Comms-Handler +}: Marketing team publish blog post about upcoming GCP failover
1. [ ] ☎ {+ Comms-Handler +}: Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
1. [ ] ☎ {+ Comms-Handler +}: Ensure that YouTube stream will be available for Zoom call
1. [ ] ☎ {+ Comms-Handler +}: Tweet blog post from `@gitlab` and `@gitlabstatus`
- `Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from __MAINTENANCE_START_TIME__ - __MAINTENANCE_END_TIME__ UTC. Follow @gitlabstatus for more details. __BLOG_POST_URL__`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/040_t-3h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 3600)`
# T minus 1 hour (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/050_t-1h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__`
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `/opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
1. [ ] **PRODUCTION ONLY** 🔪 {+ Chef-Runner +}: Silence production alerts
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
* `provider`: `azure`
* `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [ ] ☎ {+ Comms-Handler +}: Create a broadcast message
* Staging: https://staging.gitlab.com/admin/broadcast_messages
* Production: https://gitlab.com/admin/broadcast_messages
* Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from __MAINTENANCE_START_TIME__ on __FAILOVER_DATE__ UTC`
* Start date: now
* End date: now + 3 hours
1. [ ] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
* `sudo crontab -e` to get an editor window, comment out the line involving a pages-sync script
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after the transfer starts will be lost.
* The user running the rsync _must_ have full sudo access on both azure and gcp pages.
* Very manual, looks a little like the following at present:
* Staging:
```
ssh 10.133.2.161 # nfs-pages-staging-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
* Production:
```
ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
## Failover Call
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.
Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:
* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.
### Roll call
- [ ] 🐺 {+ Comms-Handler +}: make sure Youtube stream is started
- [ ] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [ ] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call
### Notify Users of Maintenance Window
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- `GitLab.com planned maintenance for migration to @GCPcloud is starting. See you on the other side! We'll be live on YouTube`
### Monitoring
- [ ] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
* Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe0{1..9}.lb.gitlab.com fe{10..16}.lb.gitlab.com altssh0{1..2}.lb.gitlab.com`
### Health check
1. [ ] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
* Staging
* GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
* Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
* Production
* GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
* Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
# T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
## Failover Procedure
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Run `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Run `knife ssh roles:staging-base-fe-git 'sudo chef-client'`
* Production:
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
* Run `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
* Run `knife ssh roles:gitlab-base-fe-git 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
* Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
1. [ ] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
1. [ ] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh`
1. [ ] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
* In a separate terminal on the deploy host: `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* The loop should be stopped once sidekiq is shut down
* Wait for `--> Status: PROCEED`
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
* Expand the `Verification Info` tab
* Wait for the number of `unverified` repositories to reach 0
* Resolve any repositories that have `failed` verification
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs
* Press `Queues -> Live Poll`
* Wait for all queues not mentioned above to reach 0
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
* On staging, the repository verification queue may not empty
1. [ ] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
* Staging: https://staging.gitlab.com/admin/sidekiq/retries
* Production: https://gitlab.com/admin/sidekiq/retries
* **NOTE**: This tab may contain confidential information. Do this out of screen capture!
* Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
* Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
* Press "Retry All" to attempt to retry all remaining jobs immediately
* Repeat until 0 retries are present
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [ ] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [ ] 🐺 {+ Coordinator +}: Flush CI traces in Redis to the database
* In a Rails console in Azure:
* `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Press "Sync Information"
* Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
* On staging, this may not complete
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
* Press "Verification Information"
* Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
* On staging, verification may not complete
1. [ ] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
* Staging: `postgres-01.db.gstg.gitlab.com`
* Production: `postgres-01-db-gprd.c.gitlab-production.internal`
* `sudo gitlab-psql -d gitlabhq_production -c "SELECT now() - pg_last_xact_replay_timestamp();"`
* Assuming the clocks are in sync, this value should be close to 0
* If this is a large number, GCP may not have some data that is in Azure
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The loop should be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* `Busy`, `Enqueued`, `Scheduled`, and `Retry` should all be 0
* If a `geo_metrics_update` job is running, that can be ignored
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [ ] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.
## Promote the secondary
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
* The incremental `rsync` commands set off above should be completed by now
* If still ongoing, the DNS update will cause some Pages sites to temporarily revert
1. [ ] ☁ {+ Cloud-conductor +}: Update DNS entries to refer to the GCP load-balancers
* Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
* Staging
- [ ] `staging.gitlab.com A 35.227.123.228`
- [ ] `altssh.staging.gitlab.com A 35.185.33.132`
- [ ] `*.staging.gitlab.io A 35.229.69.78`
- **DO NOT** change `staging.gitlab.io`.
* Production **UNTESTED**
- [ ] `gitlab.com A 35.231.145.151`
- [ ] `altssh.gitlab.com A 35.190.168.187`
- [ ] `*.gitlab.io A 35.185.44.232`
- **DO NOT** change `gitlab.io`.
1. [ ] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=100 where name like '%gstg%'"
```
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
* Keep everything, just ensure it’s turned off
```shell
$ knife ssh "role:staging-base-db-postgres AND NOT fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
* Keep everything, just ensure it’s turned off
```shell
$ knife ssh "fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```
1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
- [ ] Confirm pgbouncer node in GCP (Password is in 1password)
* Staging: `pgbouncer-01-db-gstg`
* Production: `pgbouncer-01-db-gprd`
```shell
$ gitlab-ctl pgb-console
...
pgbouncer# SHOW DATABASES;
# You want to see lines like
gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 100 | 5 | | 0 | 0
gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 150 | 5 | | 0 | 0
...
pgbouncer# SHOW SERVERS;
# You want to see lines like
S | gitlab | gitlabhq_production | idle | PRIMARY_IP | 5432 | PGBOUNCER_IP | 54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 | | 19430 |
```
1. [ ] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [ ] Promote the desired primary
```shell
$ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
```
- [ ] Instruct the remaining standby nodes to follow the new primary
```shell
$ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
```
*Note*: This will fail on the WAL-E node
1. [ ] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
* Connect to the newly promoted primary in GCP
* `sudo gitlab-psql -d gitlabhq_production -c "select * from pg_is_in_recovery();"`
* The result should be `F`
1. [ ] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
1. [ ] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
* **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
* **PRODUCTION** **UNTESTED** `knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
* Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that important processes have been restarted on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* [ ] Unicorn
* [ ] Sidekiq
* [ ] Gitlab Pages
1. [ ] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
* This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
* In a Rails console in GCP:
* Staging: `GeoNode.where(url: "https://gstg.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
* Production: `GeoNode.where(url: "https://gprd.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
1. [ ] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
* `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
1. [ ] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
* `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
1. [ ] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
* This will automatically re-enable the disabled sidekiq-cron jobs
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
[ ] Check that sidekiq processes show up in the GitLab admin panel
#### Health check
1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
* Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
* Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
* The old primary in the GCP environment, backed by WAL-E log shipping, will
report "replication lag too large" and "unused replication slot". This is OK.
## During-Blackout QA
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
The details of the QA tasks are listed in the test plan document.
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA automated tests have succeeded
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA manual tests have succeeded
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.
The top priority is to maintain data integrity. Failing back after the blackout
window has ended is very difficult, and will result in any changes made in the
interim being lost.
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
Problems may be categorized into three broad causes - "unknown", "missing data",
or "misconfiguration". Testers should focus on determining which bucket
a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
* A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [ ] `emails_on_push` queue
1. [ ] `mailers` queue
1. [ ] (`admin_emails` queue doesn't exist any more)
1. [ ] Rotate the password of the incoming@gitlab.com account and update the vault
1. [ ] Run chef-client and restart mailroom:
* `bundle exec knife ssh role:gprd-base-be-mailroom 'sudo chef-client; sudo gitlab-ctl restart mailroom'`
1. [ ] 🐺 {+Coordinator+}: **PRODUCTION ONLY** Ensure the secondary can send emails
1. [ ] Run the following in a Rails console (changing `you` to yourself): `Notify.test_email("you+test@gitlab.com", "Test email", "test").deliver_now`
1. [ ] Ensure you receive the email
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
- [ ] Update in chef cookbooks by removing the setting entirely
- [ ] Update in the running database
- [ ] On the primary server, run `gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Reduce `statement_timeout` to 15s.
- [ ] Merge and chef this change: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334
- [ ] Close https://gitlab.com/gitlab-com/migration/issues/686
1. [ ] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
* Staging:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* `bundle exec knife ssh role:staging-base-lb-pages 'sudo chef-client'`
* Production:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
* `bundle exec knife ssh role:gitlab-base-lb-pages 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
* Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
* Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- `GitLab.com planned maintenance for migration to @GCPcloud is almost complete. GitLab.com is available although we're continuing to verify that all systems are functioning correctly. We're live on YouTube``
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA manual tests have succeeded
## **PRODUCTION ONLY** Post migration
1. [ ] 🐺 {+ Coordinator +}: Close the failback issue - it isn't needed
1. [ ] ☁ {+ Cloud-conductor +}: Disable unneeded resources in the Azure environment
completion more effectively
* The Pages LB proxy must be retained
* We should retain all filesystem data for a defined period in case of problems (1 week? 3 months?)
* Unused machines can be switched off
1. [ ] ☁ {+ Cloud-conductor +}: Change GitLab settings: [https://gprd.gitlab.com/admin/application_settings](https://gprd.gitlab.com/admin/application_settings)
* Metrics - Influx -> InfluxDB host should be `performance-01-inf-gprd.c.gitlab-production.internal`https://dev.gitlab.org/gitlab-com/migration/-/issues/862018-08-08 STAGING failover attempt: preflight checks2018-08-07T18:50:26Zgcp-migration-bot **only needed for the migration effort**2018-08-08 STAGING failover attempt: preflight checks# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1...# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
- Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
- Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
1. [ ] 🐺 {+Coordinator+}: Review the failover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
## GitLab Version and CDN Checks
1. [ ] 🐺 {+Coordinator+}: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for `db` nodes (`tier` == `db`) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
- Versions can be confirmed using the Omnibus version tracker dashboards:
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd
1. [ ] 🐺 {+Coordinator+}: Ensure that the fastly CDN ip ranges are up-to-date.
- Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
- Staging
- GCP `gstg`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gstg-base-lb-fe.json#L48
- Production
- GCP `gprd`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gprd-base-lb-fe.json#L56
## Object storage
1. [ ] 🐺 {+Coordinator+}: Ensure primary and secondary share the same object storage configuration. For each line below,
execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console,
appending `==`, and pasting the results from the primary console. You should get a `true` or `false` value.
1. [ ] `Gitlab.config.uploads`
1. [ ] `Gitlab.config.lfs`
1. [ ] `Gitlab.config.artifacts`
1. [ ] 🐺 {+Coordinator+}: Ensure all artifacts and LFS objects are in object storage
* If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
* On staging, these numbers are non-zero. Just mark as checked.
1. [ ] `Upload.with_files_stored_locally.count` # => 0
1. [ ] `LfsObject.with_files_stored_locally.count` # => 13 (there are a small number of known-lost LFS objects)
1. [ ] `Ci::JobArtifact.with_files_stored_locally.count` # => 0
## Pre-migrated services
1. [ ] 🐺 {+Coordinator+}: Check that the container registry has been [pre-migrated to GCP](https://gitlab.com/gitlab-com/migration/issues/466)
## Configuration checks
1. [ ] 🐺 {+Coordinator+}: Ensure `gitlab-rake gitlab:geo:check` reports no errors on the primary or secondary
* A warning may be output regarding `AuthorizedKeysCommand`. This is OK, and tracked in [infrastructure#4280](https://gitlab.com/gitlab-com/infrastructure/issues/4280).
1. Compare some files on a representative node (a web worker) between primary and secondary:
1. [ ] Manually compare the diff of `/etc/gitlab/gitlab.rb`
1. [ ] Manually compare the diff of `/etc/gitlab/gitlab-secrets.json`
1. [ ] 🐺 {+Coordinator+}: Check SSH host keys match
* Staging:
- [ ] `bin/compare-host-keys staging.gitlab.com gstg.gitlab.com`
- [ ] `SSH_PORT=443 bin/compare-host-keys altssh.staging.gitlab.com altssh.gstg.gitlab.com`
* Production:
- [ ] `bin/compare-host-keys gitlab.com gprd.gitlab.com`
- [ ] `SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com`
1. [ ] 🐺 {+Coordinator+}: Ensure repository and wiki verification feature flag shows as enabled on both **primary** and **secondary**
* `Feature.enabled?(:geo_repository_verification)`
1. [ ] 🐺 {+Coordinator+}: Ensure the TTL for affected DNS records is low
* 300 seconds is fine
* Staging:
- [ ] `staging.gitlab.com`
- [ ] `altssh.staging.gitlab.com`
- [ ] `gitlab-org.staging.gitlab.io`
* Production:
- [ ] `gitlab.com`
- [ ] `altssh.gitlab.com`
- [ ] `gitlab-org.gitlab.io`
1. [ ] 🐺 {+Coordinator+}: Ensure SSL configuration on the secondary is valid for primary domain names too
* Handy script in the migration repository: `bin/check-ssl <hostname>:<port>`
* Staging:
- [ ] `bin/check-ssl gstg.gitlab.com:443`
- [ ] `bin/check-ssl gitlab-org.gstg.gitlab.io:443`
* Production:
- [ ] `bin/check-ssl gprd.gitlab.com:443`
- [ ] `bin/check-ssl gitlab-org.gprd.gitlab.io:443`
1. [ ] 🔪 {+Chef-Runner+}: Ensure SSH connectivity to all hosts, including host key verification
* `chef-client role:gitlab-base pwd`
1. [ ] 🔪 {+Chef-Runner+}: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
1. [ ] `bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'`
1. [ ] 🔪 {+Chef-Runner+}: Ensure that mailroom nodes have been configured with the right roles:
* Staging: `bundle exec knife ssh "role:gstg-base-be-mailroom" hostname`
* Production: `bundle exec knife ssh "role:gprd-base-be-mailroom" hostname`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure all hot-patches are applied to the target environment:
1. Fetch the latest version of [post-deployment-patches](https://dev.gitlab.org/gitlab/post-deployment-patches/)
1. Check the omnibus version running in the target environment
* Staging: `knife role show gstg-omnibus-version | grep version:`
* Production: `knife role show gprd-omnibus-version | grep version:`
1. In `post-deployment-patches`, ensure that the version maninfest has a corresponding GCP Chef role under the target environment
* E.g. In `11.1/MANIFEST.yml`, `versions.11.1.0-rc10-ee.environments.staging` should have `gstg-base-fe-api` along with `staging-base-fe-api`
1. Run `gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod`
* The command can fail because the patches may have already been applied, that's OK.
1. [ ] 🔪 {+Chef-Runner+}: Outstanding merge requests are up to date vs. `master`:
* Staging:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2333)
* Production:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
## Ensure Geo replication is up to date
1. [ ] 🐺 {+Coordinator+}: Ensure database replication is healthy and up to date
* Create a test issue on the primary and wait for it to appear on the secondary
* This should take less than 5 minutes at most
1. [ ] 🐺 {+Coordinator+}: Ensure sidekiq is healthy
* `Busy` + `Enqueued` + `Retries` should total less than 10,000, with fewer than 100 retries
* `Scheduled` jobs should not be present, or should all be scheduled to be run before the failover starts
* Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs
* From a rails console: `Sidekiq::Stats.new`
* "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
* "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
1. [ ] 🐺 {+Coordinator+}: Ensure **repositories** and **wikis** are at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://staging.gitlab.com/admin/geo_nodes
* Production: https://gitlab.com/admin/geo_nodes
* Observe the "Sync Information" tab for the secondary
* See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
* Staging: some failures and unsynced repositories are expected
1. [ ] 🐺 {+Coordinator+}: Local **CI artifacts**, **LFS objects** and **Uploads** should have 0 in all columns
* Staging: some failures and unsynced files are expected
* Production: this may fluctuate around 0 due to background upload. This is OK.
1. [ ] 🐺 {+Coordinator+}: Ensure Geo event log is being processed
* In a rails console for both primary and secondary: `Geo::EventLog.maximum(:id)`
* This may be `nil`. If so, perform a `git push` to a random project to generate a new event
* In a rails console for the secondary: `Geo::EventLogState.last_processed`
* All numbers should be within 10,000 of each other.
## Verify the integrity of replicated repositories and wikis
1. [ ] 🐺 {+Coordinator+}: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Review the numbers under the `Verification Information` tab for the
**secondary** node
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
1. No need to verify the integrity of anything in object storage
## Perform an automated QA run against the current infrastructure
1. [ ] 🏆 {+ Quality +}: Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
1. [ ] 🏆 {+ Quality +}: Post the result in the test plan issue. This will be used as the yardstick to compare the "During failover" automated QA run against.
## Schedule the failover
1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +}, 🏆 {+ Quality +}, and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
1. [ ] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work.
1. [ ] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "test plan" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=test_plan)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failback" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failback)
1. [ ] 🐺 {+Coordinator+}: Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issueshttps://dev.gitlab.org/gitlab-com/migration/-/issues/842018-08-07 STAGING failover attempt: failback2018-08-09T18:07:23Zgcp-migration-bot **only needed for the migration effort**2018-08-07 STAGING failover attempt: failback# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. b...# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. before opening it up to the public), it will also be super-useful to have
this documented and tested.
The priority is to get the Azure site working again as quickly as possible. As
the GCP side will be inaccessible, returning it to operation is of secondary
importance.
This issue should not be closed until both Azure and GCP sites are in full
working order, including database replication between the two sites.
## Fail back to the Azure site
1. [ ] ↩️ {+ Fail-back Handler +}: Make the GCP environment **inaccessible** again, if necessary
1. Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
1. Production: ???
1. [x] ↩️ {+ Fail-back Handler +}: Update the DNS entries to refer to the Azure load balancer
1. Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
1. Staging
- [x] `staging.gitlab.com A 40.84.60.110`
- [x] `altssh.staging.gitlab.com A 104.46.121.194`
- [x] `*.staging.gitlab.io CNAME pages01.stg.gitlab.com`
1. Production
- [ ] `gitlab.com A 52.167.219.168`
- [ ] `altssh.gitlab.com A 52.167.133.162`
- [ ] `*.gitlab.io A 52.167.214.135`
1. [ ] OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback.
- [ ] Start the primary Azure node
```shell
azure_primary# gitlab-ctl start postgresql
```
- [ ] Remove nodes from the Azure repmgr cluster.
```shell
azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
- [ ] In a tmux or screen session on the Azure standby node, resync the database
```shell
azure_standby# PGSSLMODE=disable gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
```
**Note**: This can run for several hours. Do not wait for completion.
- [ ] Remove Azure nodes from the GCP cluster by running this on the GCP primary
```shell
gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
1. [ ] Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it
**Note:** Skip this if introducing a postgresql split-brain
1. [x] Ensure that repmgr priorities for GCP are -1. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=-1 where name like '%gstg%'"
```
1. [x] Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: `gitlab-ctl stop postgresql`
1. [x] Start postgresql on the Azure staging primary node `gitlab-ctl start postgresql`
1. [x] Ensure `gitlab-ctl repmgr cluster show` reports an Azure node as the primary in Azure:
```shell
gitlab-ctl repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------
* master | postgres02.db.stg.gitlab.com | | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
standby | postgres01.db.stg.gitlab.com | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
```
1. [x] Start Azure secondaries
* Start postgresql on the Azure staging secondary node `gitlab-ctl start postgresql`
* Verify it replicates from the primary. On the primary take a look at `SELECT * FROM pg_stat_replication` which should include the newly started secondary.
* Production: Repeat the above for other Azure secondaries. Start one after the other.
1. [x] ↩️ {+ Fail-back Handler +}: **Verify that the DNS update has propagated**
back online
1. [x] ↩️ {+ Fail-back Handler +}: Start sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
1. [x] ↩️ {+ Fail-back Handler +}: Restore the Azure Pages load-balancer configuration
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
1. [ ] ↩️ {+ Fail-back Handler +}: Set the GitLab shared runner timeout back to 3 hours
1. [x] ↩️ {+ Fail-back Handler +}: Restart automatic incremental GitLab Pages sync
* Enable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, uncomment the line involving rsync
1. [x] ↩️ {+ Fail-back Handler +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
1. [x] ↩️ {+ Fail-back Handler +}: Enable access to the azure environment from the
outside world
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
## Restore the GCP site to being a working secondary
1. [ ] ↩️ {+ Fail-back Handler +}: Turn the GCP site back into a secondary
* Undo the chef-repo changes. If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* `bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
* `bundle exec knife role from file roles/gprd-base-fe-web.json roles/gprd-base.json`
1. [x] Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster
1. Remove the old data with `rm -rf /var/opt/gitlab/postgresql/data`
1. Re-initialize the database by running:
**Note:** This step can take over an hour. Consider running it in a screen/tmux session.
```shell
# su gitlab-psql -c "PGSSLMODE=disable /opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr"
```
1. Start the database with `gitlab-ctl start postgresql`
1. Register the database with the cluster by running `gitlab-ctl repmgr standby register`
1. [x] ↩️ {+ Fail-back Handler +}: Reconfigure every changed gstg node
1. bundle exec knife ssh roles:gstg-base "sudo chef-client"
1. [x] ↩️ {+ Fail-back Handler +}: Clear cache on gstg web nodes to correct broadcast message cache
* `sudo gitlab-rake cache:clear:redis`
1. [x] ↩️ {+ Fail-back Handler +}: Restart Unicorn and Sidekiq
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'`
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'`
1. [ ] ↩️ {+ Fail-back Handler +}: Verify database replication is working
1. Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
1. [ ] ↩️ {+ Fail-back Handler +}: Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top
1. [ ] ↩️ {+ Fail-back Handler +}: Confirm pgbouncer is talking to the correct hosts
* `sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/pgbouncer -U pgbouncer -d pgbouncer -p 6432`
* SQL: `SHOW DATABASES;`
1. [ ] ↩️ {+ Fail-back Handler +}: It is now safe to delete the database server snapshotsAlejandro RodriguezAlejandro Rodriguezhttps://dev.gitlab.org/gitlab-com/migration/-/issues/822018-08-07 STAGING failover attempt: main procedure2018-08-09T12:16:33Zgcp-migration-bot **only needed for the migration effort**2018-08-07 STAGING failover attempt: main procedure# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordina...# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator | @nick |
| 🔪 Chef-Runner | @alejandro |
| ☎ Comms-Handler | @dawsmith |
| 🐘 Database-Wrangler | @jarv |
| ☁ Cloud-conductor | @alejandro |
| 🏆 Quality | |
| ↩ Fail-back Handler (_Staging Only_) | @alejandro |
| 🎩 Head Honcho (_Production Only_) | @edjdev |
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
# Immediately
Perform these steps when the issue is created.
- [x] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
- [x] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
- Start Time: `1300` & End Time: `1620`
- Google Working Doc: https://docs.google.com/document/d/18vGk6dQs7L0oGQOb_bNiFa5JhwLq5WBS7oNxQy09ml8/edit
# Support Options
| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
# Database hosts
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
# Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.
* Staging:
* Azure: `web-01.sv.stg.gitlab.com`
* GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
# Dashboards and debugging
* These dashboards might be useful during the failover:
* Staging:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Sentry includes application errors. At present, Azure and GCP log to the same Sentry instance
* Staging: https://sentry.gitlap.com/gitlab/staginggitlabcom/
* The logs can be used to inspect any area of the stack in more detail
* https://log.gitlab.net/
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [-] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
# T minus 3 hours (2018-08-07) [📁](bin/scripts/02_failover/040_t-3h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
1. [x] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
# T minus 1 hour (2018-08-07) [📁](bin/scripts/02_failover/050_t-1h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
1. [x] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `/opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [x] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
- [x] ☎ {+ Comms-Handler +}: Create a broadcast message
* Staging: https://staging.gitlab.com/admin/broadcast_messages
* Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from 1300 on 2018-08-07 UTC`
* Start date: now
* End date: now + 3 hours
1. [x] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
1. [x] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
* `sudo crontab -e` to get an editor window, comment out the line involving rsync
1. [x] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after the transfer starts will be lost.
* The user running the rsync _must_ have full sudo access on both azure and gcp pages.
* Very manual, looks a little like the following at present:
* Staging:
```
ssh 10.133.2.161 # nfs-pages-staging-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
# T minus zero (failover day) (2018-08-07) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
## Failover Procedure
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.
Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:
* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.
### Roll call
- [x] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [x] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call
### Monitoring
- [x] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
### Health check
1. [x] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
* Staging
* GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
* Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [x] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Run `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Run `knife ssh roles:staging-base-fe-git 'sudo chef-client'`
1. [x] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
1. [x] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [x] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
1. [x] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
* In a separate terminal on the deploy host: `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* The loop should be stopped once sidekiq is shut down
1. [x] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Expand the `Verification Info` tab
* Wait for the number of `unverified` repositories to reach 0
* Resolve any repositories that have `failed` verification
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Staging: https://staging.gitlab.com/admin/background_jobs
* Press `Queues -> Live Poll`
* Wait for all queues not mentioned above to reach 0
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
* On staging, the repository verification queue may not empty
1. [x] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
* Staging: https://staging.gitlab.com/admin/sidekiq/retries
* **NOTE**: This tab may contain confidential information. Do this out of screen capture!
* Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
* Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
* Press "Retry All" to attempt to retry all remaining jobs immediately
* Repeat until 0 retries are present
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [x] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [x] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
* [x] CI traces in Redis
* Run `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Press "Sync Information"
* Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
* On staging, this may not complete
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
* Press "Verification Information"
* Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
* On staging, verification may not complete
1. [x] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
1. [x] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
* Staging: `postgres-01.db.gstg.gitlab.com`
* `sudo gitlab-psql -d gitlabhq_production -c "SELECT now() - pg_last_xact_replay_timestamp();"`
* Assuming the clocks are in sync, this value should be close to 0
* If this is a large number, GCP may not have some data that is in Azure
1. [x] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The loop should be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Review status of the running Sidekiq monitor script started in [phase 2, above](#phase-2-commence-shutdown-in-azure-), wait for `--> Status: PROCEED`
* Need more details?
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [x] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.
## Promote the secondary
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [x] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
* The incremental `rsync` commands set off above should be completed by now
* If still ongoing, the DNS update will cause some Pages sites to temporarily revert
1. [x] ☁ {+ Cloud-conductor +}: Update DNS entries to refer to the GCP load-balancers
* Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
* Staging
- [x] `staging.gitlab.com A 35.227.123.228`
- [x] `altssh.staging.gitlab.com A 35.185.33.132`
- [x] `*.staging.gitlab.io A 35.229.69.78`
- **DO NOT** change `staging.gitlab.io`.
1. [x] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=100 where name like '%gstg%'"
```
1. [x] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
* Keep everything, just ensure it’s turned off
```shell
$ knife ssh "role:staging-base-db-postgres AND NOT fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```
1. [x] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
* Keep everything, just ensure it’s turned off
```shell
$ knife ssh "fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```
1. [x] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
- [ ] Confirm pgbouncer node in GCP (Password is in 1password)
* Staging: `pgbouncer-01-db-gstg`
* Production: `pgbouncer-01-db-gprd`
```shell
$ gitlab-ctl pgb-console
...
pgbouncer# SHOW DATABASES;
# You want to see lines like
gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 100 | 5 | | 0 | 0
gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 150 | 5 | | 0 | 0
...
pgbouncer# SHOW SERVERS;
# You want to see lines like
S | gitlab | gitlabhq_production | idle | PRIMARY_IP | 5432 | PGBOUNCER_IP | 54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 | | 19430 |
```
1. [x] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [ ] Promote the desired primary
```shell
$ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
```
- [ ] Instruct the remaining standby nodes to follow the new primary
```shell
$ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
```
*Note*: This will fail on the WAL-E node
1. [x] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
* Connect to the newly promoted primary in GCP
* `sudo gitlab-psql -d gitlabhq_production -c "select * from pg_is_in_recovery();"`
* The result should be `F`
1. [x] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
1. [x] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
* **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
1. [x] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
1. [x] 🔪 {+ Chef-Runner +}: Ensure that important processes have been restarted on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* [ ] Unicorn
* [ ] Sidekiq
* [ ] Gitlab Pages
1. [x] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
* This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
* In a Rails console in GCP:
* Staging: `GeoNode.where(url: "https://gstg.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
1. [ ] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
* `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
1. [ ] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
* `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
1. [ ] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
* This will automatically re-enable the disabled sidekiq-cron jobs
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
[ ] Check that sidekiq processes show up in the GitLab admin panel
#### Health check
1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
* Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
* The old primary in the GCP environment, backed by WAL-E log shipping, will
report "replication lag too large" and "unused replication slot". This is OK.
## During-Blackout QA
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05) (SKIPPED)
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.
The top priority is to maintain data integrity. Failing back after the blackout
window has ended is very difficult, and will result in any changes made in the
interim being lost.
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
Problems may be categorized into three broad causes - "unknown", "missing data",
or "misconfiguration". Testers should focus on determining which bucket
a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
* A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07) (all production-only)
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
* Staging:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* `bundle exec knife ssh role:staging-base-lb-pages 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
* Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10) (SKIPPED)
Nick ThomasNick Thomashttps://dev.gitlab.org/gitlab-com/migration/-/issues/812018-08-07 STAGING failover attempt: preflight checks2018-08-06T16:55:44Zgcp-migration-bot **only needed for the migration effort**2018-08-07 STAGING failover attempt: preflight checks# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1...# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
- Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
- Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
1. [ ] 🐺 {+Coordinator+}: Review the failover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
## GitLab Version and CDN Checks
1. [ ] 🐺 {+Coordinator+}: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for `db` nodes (`tier` == `db`) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
- Versions can be confirmed using the Omnibus version tracker dashboards:
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd
1. [ ] 🐺 {+Coordinator+}: Ensure that the fastly CDN ip ranges are up-to-date.
- Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
- Staging
- GCP `gstg`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gstg-base-lb-fe.json#L48
- Production
- GCP `gprd`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gprd-base-lb-fe.json#L56
## Object storage
1. [ ] 🐺 {+Coordinator+}: Ensure primary and secondary share the same object storage configuration. For each line below,
execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console,
appending `==`, and pasting the results from the primary console. You should get a `true` or `false` value.
1. [ ] `Gitlab.config.uploads`
1. [ ] `Gitlab.config.lfs`
1. [ ] `Gitlab.config.artifacts`
1. [ ] 🐺 {+Coordinator+}: Ensure all artifacts and LFS objects are in object storage
* If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
* On staging, these numbers are non-zero. Just mark as checked.
1. [ ] `Upload.with_files_stored_locally.count` # => 0
1. [ ] `LfsObject.with_files_stored_locally.count` # => 13 (there are a small number of known-lost LFS objects)
1. [ ] `Ci::JobArtifact.with_files_stored_locally.count` # => 0
## Pre-migrated services
1. [ ] 🐺 {+Coordinator+}: Check that the container registry has been [pre-migrated to GCP](https://gitlab.com/gitlab-com/migration/issues/466)
## Configuration checks
1. [ ] 🐺 {+Coordinator+}: Ensure `gitlab-rake gitlab:geo:check` reports no errors on the primary or secondary
* A warning may be output regarding `AuthorizedKeysCommand`. This is OK, and tracked in [infrastructure#4280](https://gitlab.com/gitlab-com/infrastructure/issues/4280).
1. Compare some files on a representative node (a web worker) between primary and secondary:
1. [ ] Manually compare the diff of `/etc/gitlab/gitlab.rb`
1. [ ] Manually compare the diff of `/etc/gitlab/gitlab-secrets.json`
1. [ ] 🐺 {+Coordinator+}: Check SSH host keys match
* Staging:
- [ ] `bin/compare-host-keys staging.gitlab.com gstg.gitlab.com`
- [ ] `SSH_PORT=443 bin/compare-host-keys altssh.staging.gitlab.com altssh.gstg.gitlab.com`
* Production:
- [ ] `bin/compare-host-keys gitlab.com gprd.gitlab.com`
- [ ] `SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com`
1. [ ] 🐺 {+Coordinator+}: Ensure repository and wiki verification feature flag shows as enabled on both **primary** and **secondary**
* `Feature.enabled?(:geo_repository_verification)`
1. [ ] 🐺 {+Coordinator+}: Ensure the TTL for affected DNS records is low
* 300 seconds is fine
* Staging:
- [ ] `staging.gitlab.com`
- [ ] `altssh.staging.gitlab.com`
- [ ] `gitlab-org.staging.gitlab.io`
* Production:
- [ ] `gitlab.com`
- [ ] `altssh.gitlab.com`
- [ ] `gitlab-org.gitlab.io`
1. [ ] 🐺 {+Coordinator+}: Ensure SSL configuration on the secondary is valid for primary domain names too
* Handy script in the migration repository: `bin/check-ssl <hostname>:<port>`
* Staging:
- [ ] `bin/check-ssl gstg.gitlab.com:443`
- [ ] `bin/check-ssl gitlab-org.gstg.gitlab.io:443`
* Production:
- [ ] `bin/check-ssl gprd.gitlab.com:443`
- [ ] `bin/check-ssl gitlab-org.gprd.gitlab.io:443`
1. [ ] 🔪 {+Chef-Runner+}: Ensure SSH connectivity to all hosts, including host key verification
* `chef-client role:gitlab-base pwd`
1. [ ] 🔪 {+Chef-Runner+}: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
1. [ ] `bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'`
1. [ ] 🔪 {+Chef-Runner+}: Ensure that mailroom nodes have been configured with the right roles:
* Staging: `bundle exec knife ssh "role:gstg-base-be-mailroom" hostname`
* Production: `bundle exec knife ssh "role:gprd-base-be-mailroom" hostname`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure all hot-patches are applied to the target environment:
1. Fetch the latest version of [post-deployment-patches](https://dev.gitlab.org/gitlab/post-deployment-patches/)
1. Check the omnibus version running in the target environment
* Staging: `knife role show gstg-omnibus-version | grep version:`
* Production: `knife role show gprd-omnibus-version | grep version:`
1. In `post-deployment-patches`, ensure that the version maninfest has a corresponding GCP Chef role under the target environment
* E.g. In `11.1/MANIFEST.yml`, `versions.11.1.0-rc10-ee.environments.staging` should have `gstg-base-fe-api` along with `staging-base-fe-api`
1. Run `gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod`
* The command can fail because the patches may have already been applied, that's OK.
1. [ ] 🔪 {+Chef-Runner+}: Outstanding merge requests are up to date vs. `master`:
* Staging:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2333)
* Production:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
## Ensure Geo replication is up to date
1. [ ] 🐺 {+Coordinator+}: Ensure database replication is healthy and up to date
* Create a test issue on the primary and wait for it to appear on the secondary
* This should take less than 5 minutes at most
1. [ ] 🐺 {+Coordinator+}: Ensure sidekiq is healthy
* `Busy` + `Enqueued` + `Retries` should total less than 10,000, with fewer than 100 retries
* `Scheduled` jobs should not be present, or should all be scheduled to be run before the failover starts
* Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs
* From a rails console: `Sidekiq::Stats.new`
* "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
* "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
1. [ ] 🐺 {+Coordinator+}: Ensure **repositories** and **wikis** are at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://staging.gitlab.com/admin/geo_nodes
* Production: https://gitlab.com/admin/geo_nodes
* Observe the "Sync Information" tab for the secondary
* See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
* Staging: some failures and unsynced repositories are expected
1. [ ] 🐺 {+Coordinator+}: Local **CI artifacts**, **LFS objects** and **Uploads** should have 0 in all columns
* Staging: some failures and unsynced files are expected
* Production: this may fluctuate around 0 due to background upload. This is OK.
1. [ ] 🐺 {+Coordinator+}: Ensure Geo event log is being processed
* In a rails console for both primary and secondary: `Geo::EventLog.maximum(:id)`
* This may be `nil`. If so, perform a `git push` to a random project to generate a new event
* In a rails console for the secondary: `Geo::EventLogState.last_processed`
* All numbers should be within 10,000 of each other.
## Verify the integrity of replicated repositories and wikis
1. [ ] 🐺 {+Coordinator+}: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Review the numbers under the `Verification Information` tab for the
**secondary** node
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
1. No need to verify the integrity of anything in object storage
## Perform an automated QA run against the current infrastructure
1. [ ] 🏆 {+ Quality +}: Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
1. [ ] 🏆 {+ Quality +}: Post the result in the test plan issue. This will be used as the yardstick to compare the "During failover" automated QA run against.
## Schedule the failover
1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +}, 🏆 {+ Quality +}, and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
1. [ ] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work.
1. [ ] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "test plan" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=test_plan)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failback" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failback)
1. [ ] 🐺 {+Coordinator+}: Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issueshttps://dev.gitlab.org/gitlab-com/migration/-/issues/802018-08-04 PRODUCTION DRY RUN failover attempt: failback2018-08-04T20:29:59Zgcp-migration-bot **only needed for the migration effort**2018-08-04 PRODUCTION DRY RUN failover attempt: failback# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. b...# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. before opening it up to the public), it will also be super-useful to have
this documented and tested.
The priority is to get the Azure site working again as quickly as possible. As
the GCP side will be inaccessible, returning it to operation is of secondary
importance.
This issue should not be closed until both Azure and GCP sites are in full
working order, including database replication between the two sites.
## Fail back to the Azure site
1. [ ] ↩️ {+ Fail-back Handler +}: Make the GCP environment **inaccessible** again, if necessary
1. Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
1. Production: ???
1. [ ] ↩️ {+ Fail-back Handler +}: Update the DNS entries to refer to the Azure load balancer
1. Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
1. Staging
- [ ] `staging.gitlab.com A 40.84.60.110`
- [ ] `altssh.staging.gitlab.com A 104.46.121.194`
- [ ] `*.staging.gitlab.io CNAME pages01.stg.gitlab.com`
1. Production
- [ ] `gitlab.com A 52.167.219.168`
- [ ] `altssh.gitlab.com A 52.167.133.162`
- [ ] `*.gitlab.io A 52.167.214.135`
1. [ ] OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback.
- [ ] Start the primary Azure node
```shell
azure_primary# gitlab-ctl start postgresql
```
- [ ] Remove nodes from the Azure repmgr cluster.
```shell
azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
- [ ] In a tmux or screen session on the Azure standby node, resync the database
```shell
azure_standby# PGSSLMODE=disable gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
```
**Note**: This can run for several hours. Do not wait for completion.
- [ ] Remove Azure nodes from the GCP cluster by running this on the GCP primary
```shell
gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
1. [ ] Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it
**Note:** Skip this if introducing a postgresql split-brain
1. [ ] Ensure that repmgr priorities for GCP are -1. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=-1 where name like '%gstg%'"
```
1. [ ] Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: `gitlab-ctl stop postgresql`
1. [ ] Start postgresql on the Azure staging primary node `gitlab-ctl start postgresql`
1. [ ] Ensure `gitlab-ctl repmgr cluster show` reports an Azure node as the primary in Azure:
```shell
gitlab-ctl repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------
* master | postgres02.db.stg.gitlab.com | | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
standby | postgres01.db.stg.gitlab.com | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
```
1. [ ] Start Azure secondaries
* Start postgresql on the Azure staging secondary node `gitlab-ctl start postgresql`
* Verify it replicates from the primary. On the primary take a look at `SELECT * FROM pg_stat_replication` which should include the newly started secondary.
* Production: Repeat the above for other Azure secondaries. Start one after the other.
1. [ ] ↩️ {+ Fail-back Handler +}: **Verify that the DNS update has propagated**
back online
1. [ ] ↩️ {+ Fail-back Handler +}: Start sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
1. [ ] ↩️ {+ Fail-back Handler +}: Restore the Azure Pages load-balancer configuration
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
1. [ ] ↩️ {+ Fail-back Handler +}: Set the GitLab shared runner timeout back to 3 hours
1. [ ] ↩️ {+ Fail-back Handler +}: Restart automatic incremental GitLab Pages sync
* Enable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, uncomment the line involving rsync
1. [ ] ↩️ {+ Fail-back Handler +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
1. [ ] ↩️ {+ Fail-back Handler +}: Enable access to the azure environment from the
outside world
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
## Restore the GCP site to being a working secondary
1. [ ] ↩️ {+ Fail-back Handler +}: Turn the GCP site back into a secondary
* Undo the chef-repo changes. If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* `bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
* `bundle exec knife role from file roles/gprd-base-fe-web.json roles/gprd-base.json`
1. [ ] Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster
1. Remove the old data with `rm -rf /var/opt/gitlab/postgresql/data`
1. Re-initialize the database by running:
**Note:** This step can take over an hour. Consider running it in a screen/tmux session.
```shell
# su gitlab-psql -c "PGSSLMODE=disable /opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr"
```
1. Start the database with `gitlab-ctl start postgresql`
1. Register the database with the cluster by running `gitlab-ctl repmgr standby register`
1. [ ] ↩️ {+ Fail-back Handler +}: Reconfigure every changed gstg node
1. bundle exec knife ssh roles:gstg-base "sudo chef-client"
1. [ ] ↩️ {+ Fail-back Handler +}: Clear cache on gstg web nodes to correct broadcast message cache
* `sudo gitlab-rake cache:clear:redis`
1. [ ] ↩️ {+ Fail-back Handler +}: Restart Unicorn and Sidekiq
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'`
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'`
1. [ ] ↩️ {+ Fail-back Handler +}: Verify database replication is working
1. Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
1. [ ] ↩️ {+ Fail-back Handler +}: Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top
1. [ ] ↩️ {+ Fail-back Handler +}: Confirm pgbouncer is talking to the correct hosts
* `sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/pgbouncer -U pgbouncer -d pgbouncer -p 6432`
* SQL: `SHOW DATABASES;`
1. [ ] ↩️ {+ Fail-back Handler +}: It is now safe to delete the database server snapshotsAhmad SherifAhmad Sherifhttps://dev.gitlab.org/gitlab-com/migration/-/issues/782018-08-04 PRODUCTION DRY RUN failover attempt: main procedure2018-08-05T15:47:59Zgcp-migration-bot **only needed for the migration effort**2018-08-04 PRODUCTION DRY RUN failover attempt: main procedure# DRY RUN on PRODUCTION!
The intent of this DRY RUN is to test out our process as best we can on the **Production** system without negatively impacting the system or doing the actual failover. It is also a time to run any processes (su...# DRY RUN on PRODUCTION!
The intent of this DRY RUN is to test out our process as best we can on the **Production** system without negatively impacting the system or doing the actual failover. It is also a time to run any processes (such as repository verification, etc) to get the system in a state ready to be migrated.
There will be a 1 hour maintenance window.
GREMLINS NOT INVITED!
## What we want to accomplish in 1 hour
- [ ] Run through processes (blackout, queue draining, verification, etc) without actually failing over
- [ ] Time how long it takes to drain Sidekiq queues
- [ ] Repository syncing and verification
- [ ] Re-enable the system by the end of the hour, or quicker!
# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator | @bwalker |
| 🔪 Chef-Runner | @ahmadsherif |
| ☎ Comms-Handler | @dawsmith |
| 🐘 Database-Wrangler | @ibaum |
| ☁ Cloud-conductor | @ahmadsherif |
| 🏆 Quality | @remy |
| ↩ Fail-back Handler (_Staging Only_) | @ahmadsherif |
| 🎩 Head Honcho (_Production Only_) | @edjdev |
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
# Immediately
Perform these steps when the issue is created.
- [x] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
- [x] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
- Start Time: `13h00` & End Time: `14h00`
- Google Working Doc: https://docs.google.com/document/d/18vGk6dQs7L0oGQOb_bNiFa5JhwLq5WBS7oNxQy09ml8/edit (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- **PRODUCTION ONLY** Blog Post: https://about.gitlab.com/2018/07/19/gcp-move-update/
- **PRODUCTION ONLY** End Time: 14h00
# Support Options
| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
# Database hosts
## Staging
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
## Production
```mermaid
graph TD;
postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
postgres02a --> postgres03a["postgres-03.db.prd"];
postgres02a --> postgres04a["postgres-04.db.prd"];
postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
postgres01g --> postgres02g["postgres-02-db-gprd"];
postgres01g --> postgres03g["postgres-03-db-gprd"];
postgres01g --> postgres04g["postgres-04-db-gprd"];
```
# Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.
* Staging:
* Azure: `web-01.sv.stg.gitlab.com`
* GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
* Production:
* Azure: `web-01.sv.prd.gitlab.com`
* GCP: `web-01-sv-gprd.c.gitlab-production.internal`
# Grafana dashboards
These dashboards might be useful during the failover:
* Staging:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [x] ☎ {+ Comms-Handler +}: communicate date to Google
1. [x] ☎ {+ Comms-Handler +}: announce in #general slack and on team call date of failover.
1. [x] ☎ {+ Comms-Handler +}: Marketing team publish blog post about upcoming GCP failover
1. [x] ☎ {+ Comms-Handler +}: Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
1. [x] ☎ {+ Comms-Handler +}: Ensure that YouTube stream will be available for Zoom call
1. [x] ☎ {+ Comms-Handler +}: Tweet blog post from `@gitlab` and `@gitlabstatus`
- `Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from 13h00 - 14h00 UTC. Follow @gitlabstatus for more details. https://about.gitlab.com/2018/07/19/gcp-move-update/`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [x] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (2018-08-04) [📁](bin/scripts/02_failover/040_t-3h)
1. [x] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
# T minus 1 hour (2018-08-04) [📁](bin/scripts/02_failover/050_t-1h)
GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
1. [x] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until 14h00 UTC. GitLab.com will undergo maintenance in 1 hour.`
1. [x] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `/opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [x] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [x] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
1. [x] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Production
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
- `environment`: `prd`
- `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [x] ☎ {+ Comms-Handler +}: Create a broadcast message
* Production: https://gitlab.com/admin/broadcast_messages
* Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 1 hour from 13:00 on 2018-08-04 UTC`
* Start date: now
* End date: now + 3 hours
1. [x] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
* Production: `bin/snapshot-dbs production`
1. [x] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, comment out the line involving rsync
1. [x] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after the transfer starts will be lost.
* The user running the rsync _must_ have full sudo access on both azure and gcp pages.
* Very manual, looks a little like the following at present:
* Production:
```
ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
# T minus zero (failover day) (2018-08-04) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
## Failover Procedure
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.
Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:
* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.
### Roll call
- [x] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [x] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call
### Notify Users of Maintenance Window
1. [x] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com will soon shutdown for planned maintenance for testing of the migration to @GCPcloud. See you soon!`
### Monitoring
- [x] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe{01..16}.lb.gitlab.com altssh{01..02}.lb.gitlab.com`
### Health check
1. [x] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
* Production
* GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
* Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [x] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Production:
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
* Run `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
* Run `knife ssh roles:gitlab-base-fe-git 'sudo chef-client'`
1. [x] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
1. [x] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [x] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
1. [x] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh`
1. [x] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
* In a separate terminal on the deploy host: `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* Alternatively
* In a separate rails console on the **primary**:
* `loop { Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!); sleep 1 }`
1. [x] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
* Expand the `Verification Info` tab
* Wait for the number of `unverified` repositories to reach 0
* Resolve any repositories that have `failed` verification
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Production: https://gitlab.com/admin/background_jobs
* Press `Queues -> Live Poll`
* Wait for all queues not mentioned above to reach 0
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
* On staging, the repository verification queue may not empty
1. [x] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
* Production: https://gitlab.com/admin/sidekiq/retries
* **NOTE**: This tab may contain confidential information. Do this out of screen capture!
* Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
* Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
* Press "Retry All" to attempt to retry all remaining jobs immediately
* Repeat until 0 retries are present
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [x] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
* [ ] CI traces in Redis
* Run `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Press "Sync Information"
* Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
* Press "Verification Information"
* Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
1. [x] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
* Production: `postgres-01-db-gprd.c.gitlab-production.internal`
* `sudo gitlab-psql -d gitlabhq_production -c "SELECT now() - pg_last_xact_replay_timestamp();"`
* Assuming the clocks are in sync, this value should be close to 0
* If this is a large number, GCP may not have some data that is in Azure
1. [x] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Review status of the running Sidekiq monitor script started in [phase 2, above](#phase-2-commence-shutdown-in-azure-), wait for `--> Status: PROCEED`
* Need more details?
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.
## Promote the secondary
Since this is a DRY RUN on Production, most steps from "Phase 4: Reconfiguration, Part 1" have been removed
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
* The incremental `rsync` commands set off above should be completed by now
* If still ongoing, the DNS update will cause some Pages sites to temporarily revert
#### Health check
1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
* Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
* The old primary in the GCP environment, backed by WAL-E log shipping, will
report "replication lag too large" and "unused replication slot". This is OK.
## During-Blackout QA
We will not be doing any QA this run
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
_No decision to be made since we're not failing over_
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.
The top priority is to maintain data integrity. Failing back after the blackout
window has ended is very difficult, and will result in any changes made in the
interim being lost.
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
Problems may be categorized into three broad causes - "unknown", "missing data",
or "misconfiguration". Testers should focus on determining which bucket
a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
* A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.
## Restore System to pre-DRY RUN status
- [ ] run [revised failback issue](https://dev.gitlab.org/gitlab-com/migration/issues/80) to re-enable system
## Complete the Migration (T plus 2 hours)
Since this is a DRY RUN on Production, all of "Phase 7: Restart Mailing" and "Phase 8: Reconfiguration, Part 2" have been removed.
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [x] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's test migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly.`
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
No doing QA this run
## **PRODUCTION ONLY** Post migration
Since this is a DRY RUN on Production, all of this section has been removed.
Brett WalkerBrett Walkerhttps://dev.gitlab.org/gitlab-com/migration/-/issues/772018-08-04 PRODUCTION DRY RUN failover attempt: preflight checks2018-08-05T15:48:03Zgcp-migration-bot **only needed for the migration effort**2018-08-04 PRODUCTION DRY RUN failover attempt: preflight checks# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1...# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
- Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
- Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
1. [x] 🐺 {+Coordinator+}: Review the failover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
## GitLab Version and CDN Checks
1. [x] 🐺 {+Coordinator+}: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for `db` nodes (`tier` == `db`) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
- Versions can be confirmed using the Omnibus version tracker dashboards:
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd
1. [x] 🐺 {+Coordinator+}: Ensure that the fastly CDN ip ranges are up-to-date.
- Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
- Staging
- GCP `gstg`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gstg-base-lb-fe.json#L48
- Production
- GCP `gprd`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gprd-base-lb-fe.json#L56
## Object storage
1. [x] 🐺 {+Coordinator+}: Ensure primary and secondary share the same object storage configuration. For each line below,
execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console,
appending `==`, and pasting the results from the primary console. You should get a `true` or `false` value.
1. [x] `Gitlab.config.uploads`
1. [x] `Gitlab.config.lfs`
1. [x] `Gitlab.config.artifacts`
1. [x] 🐺 {+Coordinator+}: Ensure all artifacts and LFS objects are in object storage
* If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
* On staging, these numbers are non-zero. Just mark as checked.
1. [x] `Upload.with_files_stored_locally.count` # => 0
1. [x] `LfsObject.with_files_stored_locally.count` # => 13 (there are a small number of known-lost LFS objects)
1. [x] `Ci::JobArtifact.with_files_stored_locally.count` # => 0
## Pre-migrated services
1. [x] 🐺 {+Coordinator+}: Check that the container registry has been [pre-migrated to GCP](https://gitlab.com/gitlab-com/migration/issues/466)
## Configuration checks
1. [x] 🐺 {+Coordinator+}: Ensure `gitlab-rake gitlab:geo:check` reports no errors on the primary or secondary
* A warning may be output regarding `AuthorizedKeysCommand`. This is OK, and tracked in [infrastructure#4280](https://gitlab.com/gitlab-com/infrastructure/issues/4280).
1. Compare some files on a representative node (a web worker) between primary and secondary:
1. [x] Manually compare the diff of `/etc/gitlab/gitlab.rb`
1. [x] Manually compare the diff of `/etc/gitlab/gitlab-secrets.json`
1. [x] 🐺 {+Coordinator+}: Check SSH host keys match
* Staging:
- [ ] `bin/compare-host-keys staging.gitlab.com gstg.gitlab.com`
- [ ] `SSH_PORT=443 bin/compare-host-keys altssh.staging.gitlab.com altssh.gstg.gitlab.com`
* Production:
- [x] `bin/compare-host-keys gitlab.com gprd.gitlab.com`
- [x] `SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com`
1. [x] 🐺 {+Coordinator+}: Ensure repository and wiki verification feature flag shows as enabled on both **primary** and **secondary**
* `Feature.enabled?(:geo_repository_verification)`
1. [x] 🐺 {+Coordinator+}: Ensure the TTL for affected DNS records is low
* 300 seconds is fine
* Staging:
- [ ] `staging.gitlab.com`
- [ ] `altssh.staging.gitlab.com`
- [ ] `gitlab-org.staging.gitlab.io`
* Production:
- [x] `gitlab.com`
- [x] `altssh.gitlab.com`
- [x] `gitlab-org.gitlab.io`
1. [x] 🐺 {+Coordinator+}: Ensure SSL configuration on the secondary is valid for primary domain names too
* Handy script in the migration repository: `bin/check-ssl <hostname>:<port>`
* Staging:
- [ ] `bin/check-ssl gstg.gitlab.com:443`
- [ ] `bin/check-ssl gitlab-org.gstg.gitlab.io:443`
* Production:
- [x] `bin/check-ssl gprd.gitlab.com:443`
- [x] `bin/check-ssl gitlab-org.gprd.gitlab.io:443`
1. [ ] 🔪 {+Chef-Runner+}: Ensure SSH connectivity to all hosts, including host key verification
* `chef-client role:gitlab-base pwd`
1. [ ] 🔪 {+Chef-Runner+}: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
1. [ ] `bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'`
1. [ ] 🔪 {+Chef-Runner+}: Ensure that mailroom nodes have been configured with the right roles:
* Staging: `bundle exec knife ssh "role:gstg-base-be-mailroom" hostname`
* Production: `bundle exec knife ssh "role:gprd-base-be-mailroom" hostname`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure all hot-patches are applied to the target environment:
1. Fetch the latest version of [post-deployment-patches](https://dev.gitlab.org/gitlab/post-deployment-patches/)
1. Check the omnibus version running in the target environment
* Staging: `knife role show gstg-omnibus-version | grep version:`
* Production: `knife role show gprd-omnibus-version | grep version:`
1. In `post-deployment-patches`, ensure that the version maninfest has a corresponding GCP Chef role under the target environment
* E.g. In `11.1/MANIFEST.yml`, `versions.11.1.0-rc10-ee.environments.staging` should have `gstg-base-fe-api` along with `staging-base-fe-api`
1. Run `gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod`
* The command can fail because the patches may have already been applied, that's OK.
1. [ ] 🔪 {+Chef-Runner+}: Outstanding merge requests are up to date vs. `master`:
* Staging:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2333)
* Production:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)
1. [x] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
## Ensure Geo replication is up to date
1. [x] 🐺 {+Coordinator+}: Ensure database replication is healthy and up to date
* Create a test issue on the primary and wait for it to appear on the secondary
* This should take less than 5 minutes at most
1. [x] 🐺 {+Coordinator+}: Ensure sidekiq is healthy
* `Busy` + `Enqueued` + `Retries` should total less than 10,000, with fewer than 100 retries
* `Scheduled` jobs should not be present, or should all be scheduled to be run before the failover starts
* Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs
* From a rails console: `Sidekiq::Stats.new`
* "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
* "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
1. [x] 🐺 {+Coordinator+}: Ensure **repositories** and **wikis** are at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://staging.gitlab.com/admin/geo_nodes
* Production: https://gitlab.com/admin/geo_nodes
* Observe the "Sync Information" tab for the secondary
* See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
* Staging: some failures and unsynced repositories are expected
1. [x] 🐺 {+Coordinator+}: Local **CI artifacts**, **LFS objects** and **Uploads** should have 0 in all columns
* Staging: some failures and unsynced files are expected
* Production: this may fluctuate around 0 due to background upload. This is OK.
1. [x] 🐺 {+Coordinator+}: Ensure Geo event log is being processed
* In a rails console for both primary and secondary: `Geo::EventLog.maximum(:id)`
* This may be `nil`. If so, perform a `git push` to a random project to generate a new event
* In a rails console for the secondary: `Geo::EventLogState.last_processed`
* All numbers should be within 10,000 of each other.
## Verify the integrity of replicated repositories and wikis
1. [x] 🐺 {+Coordinator+}: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Review the numbers under the `Verification Information` tab for the
**secondary** node
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
1. No need to verify the integrity of anything in object storage
## Perform an automated QA run against the current infrastructure
1. [ ] 🏆 {+ Quality +}: Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
1. [ ] 🏆 {+ Quality +}: Post the result in the test plan issue. This will be used as the yardstick to compare the "During failover" automated QA run against.
## Schedule the failover
1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +} and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
1. [ ] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work.
1. [ ] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "test plan" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=test_plan)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failback" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failback)
1. [ ] 🐺 {+Coordinator+}: Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issuesBrett WalkerBrett Walkerhttps://dev.gitlab.org/gitlab-com/migration/-/issues/76IGNORE TEST. 2018-08-04 PRODUCTION failover attempt: failback2018-08-03T19:48:54Zgcp-migration-bot **only needed for the migration effort**IGNORE TEST. 2018-08-04 PRODUCTION failover attempt: failback# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. b...# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. before opening it up to the public), it will also be super-useful to have
this documented and tested.
The priority is to get the Azure site working again as quickly as possible. As
the GCP side will be inaccessible, returning it to operation is of secondary
importance.
This issue should not be closed until both Azure and GCP sites are in full
working order, including database replication between the two sites.
## Fail back to the Azure site
1. [ ] ↩️ {+ Fail-back Handler +}: Make the GCP environment **inaccessible** again, if necessary
1. Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
1. Production: ???
1. [ ] ↩️ {+ Fail-back Handler +}: Update the DNS entries to refer to the Azure load balancer
1. Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
1. Staging
- [ ] `staging.gitlab.com A 40.84.60.110`
- [ ] `altssh.staging.gitlab.com A 104.46.121.194`
- [ ] `*.staging.gitlab.io CNAME pages01.stg.gitlab.com`
1. Production
- [ ] `gitlab.com A 52.167.219.168`
- [ ] `altssh.gitlab.com A 52.167.133.162`
- [ ] `*.gitlab.io A 52.167.214.135`
1. [ ] OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback.
- [ ] Start the primary Azure node
```shell
azure_primary# gitlab-ctl start postgresql
```
- [ ] Remove nodes from the Azure repmgr cluster.
```shell
azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
- [ ] In a tmux or screen session on the Azure standby node, resync the database
```shell
azure_standby# PGSSLMODE=disable gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
```
**Note**: This can run for several hours. Do not wait for completion.
- [ ] Remove Azure nodes from the GCP cluster by running this on the GCP primary
```shell
gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
1. [ ] Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it
**Note:** Skip this if introducing a postgresql split-brain
1. [ ] Ensure that repmgr priorities for GCP are -1. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=-1 where name like '%gstg%'"
```
1. [ ] Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: `gitlab-ctl stop postgresql`
1. [ ] Start postgresql on the Azure staging primary node `gitlab-ctl start postgresql`
1. [ ] Ensure `gitlab-ctl repmgr cluster show` reports an Azure node as the primary in Azure:
```shell
gitlab-ctl repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------
* master | postgres02.db.stg.gitlab.com | | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
standby | postgres01.db.stg.gitlab.com | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
```
1. [ ] Start Azure secondaries
* Start postgresql on the Azure staging secondary node `gitlab-ctl start postgresql`
* Verify it replicates from the primary. On the primary take a look at `SELECT * FROM pg_stat_replication` which should include the newly started secondary.
* Production: Repeat the above for other Azure secondaries. Start one after the other.
1. [ ] ↩️ {+ Fail-back Handler +}: **Verify that the DNS update has propagated**
back online
1. [ ] ↩️ {+ Fail-back Handler +}: Start sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
1. [ ] ↩️ {+ Fail-back Handler +}: Restore the Azure Pages load-balancer configuration
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
1. [ ] ↩️ {+ Fail-back Handler +}: Set the GitLab shared runner timeout back to 3 hours
1. [ ] ↩️ {+ Fail-back Handler +}: Restart automatic incremental GitLab Pages sync
* Enable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, uncomment the line involving rsync
1. [ ] ↩️ {+ Fail-back Handler +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
1. [ ] ↩️ {+ Fail-back Handler +}: Enable access to the azure environment from the
outside world
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
## Restore the GCP site to being a working secondary
1. [ ] ↩️ {+ Fail-back Handler +}: Turn the GCP site back into a secondary
* Undo the chef-repo changes. If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* `bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
* `bundle exec knife role from file roles/gprd-base-fe-web.json roles/gprd-base.json`
1. [ ] Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster
1. Remove the old data with `rm -rf /var/opt/gitlab/postgresql/data`
1. Re-initialize the database by running:
**Note:** This step can take over an hour. Consider running it in a screen/tmux session.
```shell
# su gitlab-psql -c "PGSSLMODE=disable /opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr"
```
1. Start the database with `gitlab-ctl start postgresql`
1. Register the database with the cluster by running `gitlab-ctl repmgr standby register`
1. [ ] ↩️ {+ Fail-back Handler +}: Reconfigure every changed gstg node
1. bundle exec knife ssh roles:gstg-base "sudo chef-client"
1. [ ] ↩️ {+ Fail-back Handler +}: Clear cache on gstg web nodes to correct broadcast message cache
* `sudo gitlab-rake cache:clear:redis`
1. [ ] ↩️ {+ Fail-back Handler +}: Restart Unicorn and Sidekiq
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'`
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'`
1. [ ] ↩️ {+ Fail-back Handler +}: Verify database replication is working
1. Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
1. [ ] ↩️ {+ Fail-back Handler +}: Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top
1. [ ] ↩️ {+ Fail-back Handler +}: Confirm pgbouncer is talking to the correct hosts
* `sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/pgbouncer -U pgbouncer -d pgbouncer -p 6432`
* SQL: `SHOW DATABASES;`
1. [ ] ↩️ {+ Fail-back Handler +}: It is now safe to delete the database server snapshotshttps://dev.gitlab.org/gitlab-com/migration/-/issues/74IGNORE TEST. 2018-08-04 PRODUCTION failover attempt: main procedure2018-08-03T19:49:29Zgcp-migration-bot **only needed for the migration effort**IGNORE TEST. 2018-08-04 PRODUCTION failover attempt: main procedure# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordina...# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator | __TEAM_COORDINATOR__ |
| 🔪 Chef-Runner | __TEAM_CHEF_RUNNER__ |
| ☎ Comms-Handler | __TEAM_COMMS_HANDLER__ |
| 🐘 Database-Wrangler | __TEAM_DATABASE_WRANGLER__ |
| ☁ Cloud-conductor | __TEAM_CLOUD_CONDUCTOR__ |
| 🏆 Quality | __TEAM_QUALITY__ |
| ↩ Fail-back Handler (_Staging Only_) | __TEAM_FAILBACK_HANDLER__ |
| 🎩 Head Honcho (_Production Only_) | __TEAM_HEAD_HONCHO__ |
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
# Immediately
Perform these steps when the issue is created.
- [ ] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
- [ ] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
- Start Time: `__MAINTENANCE_START_TIME__` & End Time: `__MAINTENANCE_END_TIME__`
- Google Working Doc: __GOOGLE_DOC_URL__ (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- **PRODUCTION ONLY** Blog Post: __BLOG_POST_URL__
- **PRODUCTION ONLY** End Time: __MAINTENANCE_END_TIME__
# Support Options
| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
# Database hosts
## Staging
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
## Production
```mermaid
graph TD;
postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
postgres02a --> postgres03a["postgres-03.db.prd"];
postgres02a --> postgres04a["postgres-04.db.prd"];
postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
postgres01g --> postgres02g["postgres-02-db-gprd"];
postgres01g --> postgres03g["postgres-03-db-gprd"];
postgres01g --> postgres04g["postgres-04-db-gprd"];
```
# Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.
* Staging:
* Azure: `web-01.sv.stg.gitlab.com`
* GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
* Production:
* Azure: `web-01.sv.prd.gitlab.com`
* GCP: `web-01-sv-gprd.c.gitlab-production.internal`
# Grafana dashboards
These dashboards might be useful during the failover:
* Staging:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google
1. [ ] ☎ {+ Comms-Handler +}: announce in #general slack and on team call date of failover.
1. [ ] ☎ {+ Comms-Handler +}: Marketing team publish blog post about upcoming GCP failover
1. [ ] ☎ {+ Comms-Handler +}: Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
1. [ ] ☎ {+ Comms-Handler +}: Ensure that YouTube stream will be available for Zoom call
1. [ ] ☎ {+ Comms-Handler +}: Tweet blog post from `@gitlab` and `@gitlabstatus`
- `Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from __MAINTENANCE_START_TIME__ - __MAINTENANCE_END_TIME__ UTC. Follow @gitlabstatus for more details. __BLOG_POST_URL__`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/040_t-3h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
# T minus 1 hour (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/050_t-1h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__`
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `/opt/gitlab-migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Production
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
- `environment`: `prd`
- `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [ ] ☎ {+ Comms-Handler +}: Create a broadcast message
* Staging: https://staging.gitlab.com/admin/broadcast_messages
* Production: https://gitlab.com/admin/broadcast_messages
* Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from XX:XX on 2018-XX-YY UTC`
* Start date: now
* End date: now + 3 hours
1. [ ] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, comment out the line involving rsync
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after the transfer starts will be lost.
* The user running the rsync _must_ have full sudo access on both azure and gcp pages.
* Very manual, looks a little like the following at present:
* Staging:
```
ssh 10.133.2.161 # nfs-pages-staging-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
* Production:
```
ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
# T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
## Failover Procedure
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.
Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:
* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.
### Roll call
- [ ] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [ ] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call
### Notify Users of Maintenance Window
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube`
### Monitoring
- [ ] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
* Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe{01..16}.lb.gitlab.com altssh{01..02}.lb.gitlab.com`
### Health check
1. [ ] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
* Staging
* Staging
* GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
* Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
* Production
* GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
* Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Run `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Run `knife ssh roles:staging-base-fe-git 'sudo chef-client'`
* Production:
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
* Run `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
* Run `knife ssh roles:gitlab-base-fe-git 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
* Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
1. [ ] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
1. [ ] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
* `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh`
1. [ ] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
* In a separate terminal on the deploy host: `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
* Expand the `Verification Info` tab
* Wait for the number of `unverified` repositories to reach 0
* Resolve any repositories that have `failed` verification
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs
* Press `Queues -> Live Poll`
* Wait for all queues not mentioned above to reach 0
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
* On staging, the repository verification queue may not empty
1. [ ] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
* Staging: https://staging.gitlab.com/admin/sidekiq/retries
* Production: https://gitlab.com/admin/sidekiq/retries
* **NOTE**: This tab may contain confidential information. Do this out of screen capture!
* Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
* Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
* Press "Retry All" to attempt to retry all remaining jobs immediately
* Repeat until 0 retries are present
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [ ] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
* [ ] CI traces in Redis
* Run `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Press "Sync Information"
* Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
* On staging, this may not complete
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
* Press "Verification Information"
* Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
* On staging, verification may not complete
1. [ ] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
* Staging: `postgres-01.db.gstg.gitlab.com`
* Production: `postgres-01-db-gprd.c.gitlab-production.internal`
* `sudo gitlab-psql -d gitlabhq_production -c "SELECT now() - pg_last_xact_replay_timestamp();"`
* Assuming the clocks are in sync, this value should be close to 0
* If this is a large number, GCP may not have some data that is in Azure
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Review status of the running Sidekiq monitor script started in [phase 2, above](#phase-2-commence-shutdown-in-azure-), wait for `--> Status: PROCEED`
* Need more details?
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.
## Promote the secondary
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
* The incremental `rsync` commands set off above should be completed by now
* If still ongoing, the DNS update will cause some Pages sites to temporarily revert
1. [ ] ☁ {+ Cloud-conductor +}: Update DNS entries to refer to the GCP load-balancers
* Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
* Staging
- [ ] `staging.gitlab.com A 35.227.123.228`
- [ ] `altssh.staging.gitlab.com A 35.185.33.132`
- [ ] `*.staging.gitlab.io A 35.229.69.78`
- **DO NOT** change `staging.gitlab.io`.
* Production **UNTESTED**
- [ ] `gitlab.com A 35.231.145.151`
- [ ] `altssh.gitlab.com A 35.190.168.187`
- [ ] `*.gitlab.io A 35.185.44.232`
- **DO NOT** change `gitlab.io`.
1. [ ] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=100 where name like '%gstg%'"
```
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
* Keep everything, just ensure it’s turned off
```shell
$ knife ssh "role:staging-base-db-postgres AND NOT fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
* Keep everything, just ensure it’s turned off
```shell
$ knife ssh "fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```
1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
- [ ] Confirm pgbouncer node in GCP (Password is in 1password)
```shell
$ gitlab-ctl pgb-console
...
pgbouncer# SHOW DATABASES;
# You want to see lines like
gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 100 | 5 | | 0 | 0
gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 150 | 5 | | 0 | 0
...
pgbouncer# SHOW SERVERS;
# You want to see lines like
S | gitlab | gitlabhq_production | idle | PRIMARY_IP | 5432 | PGBOUNCER_IP | 54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 | | 19430 |
```
1. [ ] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [ ] Promote the desired primary
```shell
$ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
```
- [ ] Instruct the remaining standby nodes to follow the new primary
```shell
$ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
```
*Note*: This will fail on the WAL-E node
1. [ ] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
* Connect to the newly promoted primary in GCP
* `sudo gitlab-psql -d gitlabhq_production -c "select * from pg_is_in_recovery();"`
* The result should be `F`
1. [ ] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
1. [ ] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
* **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
* **PRODUCTION** **UNTESTED** `knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
* Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that important processes have been restarted on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* [ ] Unicorn
* [ ] Sidekiq
* [ ] Gitlab Pages
1. [ ] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
* This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
* In a Rails console in GCP:
* Staging: `GeoNode.where(url: "https://gstg.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
* Production: `GeoNode.where(url: "https://gprd.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
1. [ ] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
* `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
1. [ ] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
* `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
1. [ ] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
* This will automatically re-enable the disabled sidekiq-cron jobs
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
[ ] Check that sidekiq processes show up in the GitLab admin panel
#### Health check
1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
* Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
* Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
* The old primary in the GCP environment, backed by WAL-E log shipping, will
report "replication lag too large" and "unused replication slot". This is OK.
## During-Blackout QA
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
The details of the QA tasks are listed in the test plan document.
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA automated tests have succeeded
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA manual tests have succeeded
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.
The top priority is to maintain data integrity. Failing back after the blackout
window has ended is very difficult, and will result in any changes made in the
interim being lost.
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
Problems may be categorized into three broad causes - "unknown", "missing data",
or "misconfiguration". Testers should focus on determining which bucket
a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
* A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [ ] `emails_on_push` queue
1. [ ] `mailers` queue
1. [ ] (`admin_emails` queue doesn't exist any more)
1. [ ] Rotate the password of the incoming@gitlab.com account and update the vault
1. [ ] Run chef-client and restart mailroom:
* `bundle exec knife ssh role:gprd-base-be-mailroom 'sudo chef-client; sudo gitlab-ctl restart mailroom'`
1. [ ] 🐺 {+Coordinator+}: **PRODUCTION ONLY** Ensure the secondary can send emails
1. [ ] Run the following in a Rails console (changing `you` to yourself): `Notify.test_email("you+test@gitlab.com", "Test email", "test").deliver_now`
1. [ ] Ensure you receive the email
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
- [ ] Update in chef cookbooks by removing the setting entirely
- [ ] Update in the running database
- [ ] On the primary server, run `gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Reduce `statement_timeout` to 15s.
- [ ] Merge and chef this change: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334
- [ ] Close https://gitlab.com/gitlab-com/migration/issues/686
1. [ ] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
* Staging:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* `bundle exec knife ssh role:staging-base-lb-pages 'sudo chef-client'`
* Production:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
* `bundle exec knife ssh role:gitlab-base-lb-pages 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
* Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
* Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA manual tests have succeeded
## **PRODUCTION ONLY** Post migration
1. [ ] 🐺 {+ Coordinator +}: Close the failback issue - it isn't needed
1. [ ] ☁ {+ Cloud-conductor +}: Disable unneeded resources in the Azure environment
completion more effectively
* The Pages LB proxy must be retained
* We should retain all filesystem data for a defined period in case of problems (1 week? 3 months?)
* Unused machines can be switched off
1. [ ] ☁ {+ Cloud-conductor +}: Change GitLab settings: [https://gprd.gitlab.com/admin/application_settings](https://gprd.gitlab.com/admin/application_settings)
* Metrics - Influx -> InfluxDB host should be `performance-01-inf-gprd.c.gitlab-production.internal`https://dev.gitlab.org/gitlab-com/migration/-/issues/73IGNORE TEST. 2018-08-04 PRODUCTION failover attempt: preflight checks2018-08-03T19:49:42Zgcp-migration-bot **only needed for the migration effort**IGNORE TEST. 2018-08-04 PRODUCTION failover attempt: preflight checks# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1...# Pre-flight checks
## Dashboards and Alerts
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
- Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
- Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
1. [ ] 🐺 {+Coordinator+}: Review the failover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
## GitLab Version and CDN Checks
1. [ ] 🐺 {+Coordinator+}: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for `db` nodes (`tier` == `db`) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
- Versions can be confirmed using the Omnibus version tracker dashboards:
- Staging
- GCP `gstg`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gstg
- Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
- Production
- GCP `gprd`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gprd
- Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd
1. [ ] 🐺 {+Coordinator+}: Ensure that the fastly CDN ip ranges are up-to-date.
- Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
- Staging
- GCP `gstg`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gstg-base-lb-fe.json#L48
- Production
- GCP `gprd`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gprd-base-lb-fe.json#L56
## Object storage
1. [ ] 🐺 {+Coordinator+}: Ensure primary and secondary share the same object storage configuration. For each line below,
execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console,
appending `==`, and pasting the results from the primary console. You should get a `true` or `false` value.
1. [ ] `Gitlab.config.uploads`
1. [ ] `Gitlab.config.lfs`
1. [ ] `Gitlab.config.artifacts`
1. [ ] 🐺 {+Coordinator+}: Ensure all artifacts and LFS objects are in object storage
* If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
* On staging, these numbers are non-zero. Just mark as checked.
1. [ ] `Upload.with_files_stored_locally.count` # => 0
1. [ ] `LfsObject.with_files_stored_locally.count` # => 13 (there are a small number of known-lost LFS objects)
1. [ ] `Ci::JobArtifact.with_files_stored_locally.count` # => 0
## Pre-migrated services
1. [ ] 🐺 {+Coordinator+}: Check that the container registry has been [pre-migrated to GCP](https://gitlab.com/gitlab-com/migration/issues/466)
## Configuration checks
1. [ ] 🐺 {+Coordinator+}: Ensure `gitlab-rake gitlab:geo:check` reports no errors on the primary or secondary
* A warning may be output regarding `AuthorizedKeysCommand`. This is OK, and tracked in [infrastructure#4280](https://gitlab.com/gitlab-com/infrastructure/issues/4280).
1. Compare some files on a representative node (a web worker) between primary and secondary:
1. [ ] Manually compare the diff of `/etc/gitlab/gitlab.rb`
1. [ ] Manually compare the diff of `/etc/gitlab/gitlab-secrets.json`
1. [ ] 🐺 {+Coordinator+}: Check SSH host keys match
* Staging:
- [ ] `bin/compare-host-keys staging.gitlab.com gstg.gitlab.com`
- [ ] `SSH_PORT=443 bin/compare-host-keys altssh.staging.gitlab.com altssh.gstg.gitlab.com`
* Production:
- [ ] `bin/compare-host-keys gitlab.com gprd.gitlab.com`
- [ ] `SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com`
1. [ ] 🐺 {+Coordinator+}: Ensure repository and wiki verification feature flag shows as enabled on both **primary** and **secondary**
* `Feature.enabled?(:geo_repository_verification)`
1. [ ] 🐺 {+Coordinator+}: Ensure the TTL for affected DNS records is low
* 300 seconds is fine
* Staging:
- [ ] `staging.gitlab.com`
- [ ] `altssh.staging.gitlab.com`
- [ ] `gitlab-org.staging.gitlab.io`
* Production:
- [ ] `gitlab.com`
- [ ] `altssh.gitlab.com`
- [ ] `gitlab-org.gitlab.io`
1. [ ] 🐺 {+Coordinator+}: Ensure SSL configuration on the secondary is valid for primary domain names too
* Handy script in the migration repository: `bin/check-ssl <hostname>:<port>`
* Staging:
- [ ] `bin/check-ssl gstg.gitlab.com:443`
- [ ] `bin/check-ssl gitlab-org.gstg.gitlab.io:443`
* Production:
- [ ] `bin/check-ssl gprd.gitlab.com:443`
- [ ] `bin/check-ssl gitlab-org.gprd.gitlab.io:443`
1. [ ] 🔪 {+Chef-Runner+}: Ensure SSH connectivity to all hosts, including host key verification
* `chef-client role:gitlab-base pwd`
1. [ ] 🔪 {+Chef-Runner+}: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
1. [ ] `bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'`
1. [ ] 🔪 {+Chef-Runner+}: Ensure that mailroom nodes have been configured with the right roles:
* Staging: `bundle exec knife ssh "role:gstg-base-be-mailroom" hostname`
* Production: `bundle exec knife ssh "role:gprd-base-be-mailroom" hostname`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure all hot-patches are applied to the target environment:
1. Fetch the latest version of [post-deployment-patches](https://dev.gitlab.org/gitlab/post-deployment-patches/)
1. Check the omnibus version running in the target environment
* Staging: `knife role show gstg-omnibus-version | grep version:`
* Production: `knife role show gprd-omnibus-version | grep version:`
1. In `post-deployment-patches`, ensure that the version maninfest has a corresponding GCP Chef role under the target environment
* E.g. In `11.1/MANIFEST.yml`, `versions.11.1.0-rc10-ee.environments.staging` should have `gstg-base-fe-api` along with `staging-base-fe-api`
1. Run `gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod`
* The command can fail because the patches may have already been applied, that's OK.
1. [ ] 🔪 {+Chef-Runner+}: Outstanding merge requests are up to date vs. `master`:
* Staging:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2333)
* Production:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
## Ensure Geo replication is up to date
1. [ ] 🐺 {+Coordinator+}: Ensure database replication is healthy and up to date
* Create a test issue on the primary and wait for it to appear on the secondary
* This should take less than 5 minutes at most
1. [ ] 🐺 {+Coordinator+}: Ensure sidekiq is healthy
* `Busy` + `Enqueued` + `Retries` should total less than 10,000, with fewer than 100 retries
* `Scheduled` jobs should not be present, or should all be scheduled to be run before the failover starts
* Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs
* From a rails console: `Sidekiq::Stats.new`
* "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
* "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
1. [ ] 🐺 {+Coordinator+}: Ensure **repositories** and **wikis** are at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://staging.gitlab.com/admin/geo_nodes
* Production: https://gitlab.com/admin/geo_nodes
* Observe the "Sync Information" tab for the secondary
* See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
* Staging: some failures and unsynced repositories are expected
1. [ ] 🐺 {+Coordinator+}: Local **CI artifacts**, **LFS objects** and **Uploads** should have 0 in all columns
* Staging: some failures and unsynced files are expected
* Production: this may fluctuate around 0 due to background upload. This is OK.
1. [ ] 🐺 {+Coordinator+}: Ensure Geo event log is being processed
* In a rails console for both primary and secondary: `Geo::EventLog.maximum(:id)`
* This may be `nil`. If so, perform a `git push` to a random project to generate a new event
* In a rails console for the secondary: `Geo::EventLogState.last_processed`
* All numbers should be within 10,000 of each other.
## Verify the integrity of replicated repositories and wikis
1. [ ] 🐺 {+Coordinator+}: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Review the numbers under the `Verification Information` tab for the
**secondary** node
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
1. No need to verify the integrity of anything in object storage
## Perform an automated QA run against the current infrastructure
1. [ ] 🏆 {+ Quality +}: Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
1. [ ] 🏆 {+ Quality +}: Post the result in the test plan issue. This will be used as the yardstick to compare the "During failover" automated QA run against.
## Schedule the failover
1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +} and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
1. [ ] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work.
1. [ ] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "test plan" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=test_plan)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failback" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failback)
1. [ ] 🐺 {+Coordinator+}: Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issueshttps://dev.gitlab.org/gitlab-com/migration/-/issues/722018-08-02 STAGING failover attempt: failback2018-08-03T19:10:24Zgcp-migration-bot **only needed for the migration effort**2018-08-02 STAGING failover attempt: failback# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. b...# Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we
need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real"
(i.e. before opening it up to the public), it will also be super-useful to have
this documented and tested.
The priority is to get the Azure site working again as quickly as possible. As
the GCP side will be inaccessible, returning it to operation is of secondary
importance.
This issue should not be closed until both Azure and GCP sites are in full
working order, including database replication between the two sites.
## Fail back to the Azure site
1. [x] ↩️ {+ Fail-back Handler +}: Make the GCP environment **inaccessible** again, if necessary
1. Staging: Undo https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
1. Production: ???
1. [x] ↩️ {+ Fail-back Handler +}: Update the DNS entries to refer to the Azure load balancer
1. Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
1. Staging
- [x] `staging.gitlab.com A 40.84.60.110`
- [x] `altssh.staging.gitlab.com A 104.46.121.194`
- [x] `*.staging.gitlab.io CNAME pages01.stg.gitlab.com`
1. Production
- [ ] `gitlab.com A 52.167.219.168`
- [ ] `altssh.gitlab.com A 52.167.133.162`
- [ ] `*.gitlab.io A 52.167.214.135`
1. [ ] OPTIONAL: Split the postgresql cluster into two separate clusters. Only do this if you want to continue using the GCP site as a primary post-failback.
- [ ] Start the primary Azure node
```shell
azure_primary# gitlab-ctl start postgresql
```
- [ ] Remove nodes from the Azure repmgr cluster.
```shell
azure_primary# for nid in 895563110 1700935732 1681417267; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
- [ ] In a tmux or screen session on the Azure standby node, resync the database
```shell
azure_standby# PGSSLMODE=disable gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
```
**Note**: This can run for several hours. Do not wait for completion.
- [ ] Remove Azure nodes from the GCP cluster by running this on the GCP primary
```shell
gstg_primary# for nid in 895563110 912536887 ; do gitlab-ctl repmgr standby unregister --node=${nid}; done
```
1. [ ] Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it
**Note:** Skip this if introducing a postgresql split-brain
1. [x] Ensure that repmgr priorities for GCP are -1. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=-1 where name like '%gstg%'"
```
1. [x] Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: `gitlab-ctl stop postgresql`
1. [x] Start postgresql on the Azure staging primary node `gitlab-ctl start postgresql`
1. [x] Ensure `gitlab-ctl repmgr cluster show` reports an Azure node as the primary in Azure:
```shell
gitlab-ctl repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------
* master | postgres02.db.stg.gitlab.com | | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
FAILED | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
standby | postgres01.db.stg.gitlab.com | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
```
1. [x] Start Azure secondaries
* Start postgresql on the Azure staging secondary node `gitlab-ctl start postgresql`
* Verify it replicates from the primary. On the primary take a look at `SELECT * FROM pg_stat_replication` which should include the newly started secondary.
* Production: Repeat the above for other Azure secondaries. Start one after the other.
1. [x] ↩️ {+ Fail-back Handler +}: **Verify that the DNS update has propagated**
back online
1. [x] ↩️ {+ Fail-back Handler +}: Start sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
1. [x] ↩️ {+ Fail-back Handler +}: Restore the Azure Pages load-balancer configuration
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
1. [ ] ↩️ {+ Fail-back Handler +}: Set the GitLab shared runner timeout back to 3 hours
1. [x] ↩️ {+ Fail-back Handler +}: Restart automatic incremental GitLab Pages sync
* Enable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, uncomment the line involving rsync
1. [x] ↩️ {+ Fail-back Handler +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
1. [x] ↩️ {+ Fail-back Handler +}: Enable access to the azure environment from the
outside world
* Staging: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Production: Revert https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
## Restore the GCP site to being a working secondary
1. [x] ↩️ {+ Fail-back Handler +}: Turn the GCP site back into a secondary
* Undo the chef-repo changes. If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* `bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
* `bundle exec knife role from file roles/gprd-base-fe-web.json roles/gprd-base.json`
1. [x] Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster
1. Remove the old data with `rm -rf /var/opt/gitlab/postgresql/data`
1. Re-initialize the database by running:
**Note:** This step can take over an hour. Consider running it in a screen/tmux session.
```shell
# su gitlab-psql -c "PGSSLMODE=disable /opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr"
```
1. Start the database with `gitlab-ctl start postgresql`
1. Register the database with the cluster by running `gitlab-ctl repmgr standby register`
1. [x] ↩️ {+ Fail-back Handler +}: Reconfigure every changed gstg node
1. bundle exec knife ssh roles:gstg-base "sudo chef-client"
1. [x] ↩️ {+ Fail-back Handler +}: Clear cache on gstg web nodes to correct broadcast message cache
* `sudo gitlab-rake cache:clear:redis`
1. [x] ↩️ {+ Fail-back Handler +}: Restart Unicorn and Sidekiq
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_unicorn_enable:true' 'sudo gitlab-ctl restart unicorn'`
* `bundle exec knife ssh 'roles:gstg-base AND omnibus-gitlab_gitlab_rb_sidekiq-cluster_enable:true' 'sudo gitlab-ctl restart sidekiq-cluster'`
1. [x] ↩️ {+ Fail-back Handler +}: Verify database replication is working
1. Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
1. [x] ↩️ {+ Fail-back Handler +}: Verify https://gstg.gitlab.com reports it is a secondary in the blue banner on top
1. [x] ↩️ {+ Fail-back Handler +}: Confirm pgbouncer is talking to the correct hosts
* `sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/pgbouncer -U pgbouncer -d pgbouncer -p 6432`
* SQL: `SHOW DATABASES;`
1. [x] ↩️ {+ Fail-back Handler +}: It is now safe to delete the database server snapshotsAlejandro RodriguezAlejandro Rodriguezhttps://dev.gitlab.org/gitlab-com/migration/-/issues/702018-08-02 STAGING failover attempt: main procedure2018-08-02T21:52:31Zgcp-migration-bot **only needed for the migration effort**2018-08-02 STAGING failover attempt: main procedure# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordina...# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator | @nick |
| 🔪 Chef-Runner | @alejandro |
| ☎ Comms-Handler | @dawsmith |
| 🐘 Database-Wrangler | @jarv |
| ☁ Cloud-conductor | @alejandro |
| 🏆 Quality | @remy |
| ↩ Fail-back Handler (_Staging Only_) | @alejandro |
| 🎩 Head Honcho (_Production Only_) | @edjdev |
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
# Immediately
Perform these steps when the issue is created.
- [x] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
- [ ] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
- Start Time: `1400` & End Time: `1600`
- Google Working Doc: https://docs.google.com/document/d/18vGk6dQs7L0oGQOb_bNiFa5JhwLq5WBS7oNxQy09ml8/edit (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- **PRODUCTION ONLY** Blog Post: https://about.gitlab.com/2018/07/19/gcp-move-update/
- **PRODUCTION ONLY** End Time: 1600
# Support Options
| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
# Database hosts
## Staging
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
## Production
```mermaid
graph TD;
postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
postgres02a --> postgres03a["postgres-03.db.prd"];
postgres02a --> postgres04a["postgres-04.db.prd"];
postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
postgres01g --> postgres02g["postgres-02-db-gprd"];
postgres01g --> postgres03g["postgres-03-db-gprd"];
postgres01g --> postgres04g["postgres-04-db-gprd"];
```
# Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.
* Staging:
* Azure: `web-01.sv.stg.gitlab.com`
* GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
* Production:
* Azure: `web-01.sv.prd.gitlab.com`
* GCP: `web-01-sv-gprd.c.gitlab-production.internal`
# Grafana dashboards
These dashboards might be useful during the failover:
* Staging:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [x] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# ** PRODUCTION ONLY** T minus 1 week (Date TBD)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google
1. [ ] ☎ {+ Comms-Handler +}: announce in #general slack and on team call date of failover.
1. [ ] ☎ {+ Comms-Handler +}: Marketing team publish blog post about upcoming GCP failover
1. [ ] ☎ {+ Comms-Handler +}: Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
1. [ ] ☎ {+ Comms-Handler +}: Ensure that YouTube stream will be available for Zoom call
1. [ ] ☎ {+ Comms-Handler +}: Tweet blog post from `@gitlab` and `@gitlabstatus`
- `Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from 1400 - 1600 UTC. Follow @gitlabstatus for more details. https://about.gitlab.com/2018/07/19/gcp-move-update/`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (Date TBD)
1. [x] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `./bin/azure/02_failover/t-1d/010_gitlab_twitter_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `./bin/azure/02_failover/t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (Date TBD)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
1. [x] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
# T minus 1 hour (Date TBD)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until 1600 UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: https://docs.google.com/document/d/18vGk6dQs7L0oGQOb_bNiFa5JhwLq5WBS7oNxQy09ml8/edit`
1. [x] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `./bin/azure/02_failover/t-1h/020_slack_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
1. [x] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Production
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
- `environment`: `prd`
- `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [x] ☎ {+ Comms-Handler +}: Create a broadcast message
* Staging: https://staging.gitlab.com/admin/broadcast_messages
* Production: https://gitlab.com/admin/broadcast_messages
* Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from XX:XX on 2018-XX-YY UTC`
* Start date: now
* End date: now + 3 hours
1. [x] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [x] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, comment out the line involving rsync
1. [x] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after now will be lost.
* The user running the rsync _must_ have full sudo access on both azure and gcp pages.
* Very manual, looks a little like the following at present:
* Before you run the commands below, ensure that the ssh key used to ssh to the pages VMs are in your ssh-agent:
```
ssh-add -l # to list keys
ssh-add path/to/ssh/key # if you do not have the key loaded
```
* Staging:
```
ssh 10.133.2.161
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
* Production:
```
ssh -A 10.70.2.161 # nfs-pages-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 15 -n 1 sudo SSH_AUTH_SOCK=$SSH_AUTH_SOCK rsync -avh -e "ssh -oCompression=no" --rsync-path="sudo rsync" /var/opt/gitlab/gitlab-rails/shared/pages/{} $USER@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
# T minus zero (failover day) (Date TBD)
We expect the maintenance window to last for up to 2 hours, starting from now.
## Failover Procedure
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.
Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:
* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.
### Roll call
- [ ] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [ ] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call
### Notify Users of Maintenance Window
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube`
### Monitoring
- [x] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
* Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe{01..16}.lb.gitlab.com altssh{01..02}.lb.gitlab.com`
### Health check
1. [x] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
* Staging
* Staging
* GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
* Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
* Production
* GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
* Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary
1. [x] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Run `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Run `knife ssh roles:staging-base-fe-git 'sudo chef-client'`
* Production:
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
* Run `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
* Run `knife ssh roles:gitlab-base-fe-git 'sudo chef-client'`
1. [x] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
* Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
1. [x] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure
1. [x] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
* Production: `knife ssh "role:gitlab-base-be-mailroom OR role:gprd-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
1. [ ] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
* `knife ssh roles:gitlab-base-be-sidekiq-pullmirror "sudo gitlab-ctl stop sidekiq-cluster"`
1. [x] 🐺 {+ Coordinator +}: Disable Sidekiq crons that may cause updates on the primary
* In a separate rails console on the **primary**:
* `loop { Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!); sleep 1 }`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [x] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
* Expand the `Verification Info` tab
* Wait for the number of `unverified` repositories to reach 0
* Resolve any repositories that have `failed` verification
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs
* Press `Queues -> Live Poll`
* Wait for all queues not mentioned above to reach 0
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
* On staging, the repository verification queue may not empty
1. [x] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
* Staging: https://staging.gitlab.com/admin/sidekiq/retries
* Production: https://gitlab.com/admin/sidekiq/retries
* **NOTE**: This tab may contain confidential information. Do this out of screen capture!
* Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
* Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
* Press "Retry All" to attempt to retry all remaining jobs immediately
* Repeat until 0 retries are present
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
* Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining
1. [x] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
* [x] CI traces in Redis
* Run `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Press "Sync Information"
* Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
* On staging, this may not complete
1. [x] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
* Press "Verification Information"
* Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
* On staging, verification may not complete
1. [x] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
1. [x] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
* Staging: `postgres-01.db.gstg.gitlab.com`
* Production: `postgres-01-db-gprd.c.gitlab-production.internal`
* `sudo gitlab-psql -d gitlabhq_production -c "SELECT now() - pg_last_xact_replay_timestamp();"`
* Assuming the clocks are in sync, this value should be close to 0
* If this is a large number, GCP may not have some data that is in Azure
1. [x] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [x] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* Press `Queues -> Live Poll`
* Wait for all queues to reach 0, excepting `emails_on_push` and `mailers` (which are disabled)
* Wait for the number of `Enqueued` and `Busy` jobs to reach 0
* Staging: Some jobs (e.g., `file_download_dispatch_worker`) may refuse to exit. They can be safely ignored.
1. [x] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.
## Promote the secondary
#### Phase 4: Reconfiguration, Part 1
1. [x] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [x] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
* The incremental `rsync` commands set off above should be completed by now
* If still ongoing, the DNS update will cause some Pages sites to temporarily revert
1. [x] ☁ {+ Cloud-conductor +}: Update DNS entries to refer to the GCP load-balancers
* Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
* Staging
- [x] `staging.gitlab.com A 35.227.123.228`
- [x] `altssh.staging.gitlab.com A 35.185.33.132`
- [x] `*.staging.gitlab.io A 35.229.69.78`
- **DO NOT** change `staging.gitlab.io`.
* Production **UNTESTED**
- [ ] `gitlab.com A 35.231.145.151`
- [ ] `altssh.gitlab.com A 35.190.168.187`
- [ ] `*.gitlab.io A 35.185.44.232`
- **DO NOT** change `gitlab.io`.
1. [x] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
```shell
# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=100 where name like '%gstg%'"
```
1. [x] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
* Keep everything, just ensure it’s turned off
```shell
$ knife ssh "role:staging-base-db-postgres AND NOT fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```
1. [x] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
* Keep everything, just ensure it’s turned off
```shell
$ knife ssh "fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
```
1. [x] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [x] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
- [x] Confirm pgbouncer node in GCP (Password is in 1password)
```shell
$ gitlab-ctl pgb-console
...
pgbouncer# SHOW DATABASES;
# You want to see lines like
gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 100 | 5 | | 0 | 0
gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 150 | 5 | | 0 | 0
...
pgbouncer# SHOW SERVERS;
# You want to see lines like
S | gitlab | gitlabhq_production | idle | PRIMARY_IP | 5432 | PGBOUNCER_IP | 54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 | | 19430 |
```
1. [x] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [ ] Promote the desired primary
```shell
$ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
```
- [ ] Instruct the remaining standby nodes to follow the new primary
```shell
$ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
```
*Note*: This will fail on the WAL-E node
1. [x] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
* Connect to the newly promoted primary in GCP
* `sudo gitlab-psql -d gitlabhq_production -c "select * from pg_is_in_recovery();"`
* The result should be `F`
1. [x] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
1. [x] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
* **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
* **PRODUCTION** **UNTESTED** `knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
1. [x] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
* Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that important processes have been restarted on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* [ ] Unicorn
* [ ] Sidekiq
* [ ] Gitlab Pages
1. [x] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
* This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
* In a Rails console in GCP:
* Staging: `GeoNode.where(url: "https://gstg.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
* Production: `GeoNode.where(url: "https://gprd.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
1. [x] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
* `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
1. [x] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
* `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
1. [x] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
* This will automatically re-enable the disabled sidekiq-cron jobs
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
[ ] Check that sidekiq processes show up in the GitLab admin panel
#### Health check
1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
* Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
* Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
* The old primary in the GCP environment, backed by WAL-E log shipping, will
report "replication lag too large" and "unused replication slot". This is OK.
## During-Blackout QA
#### Phase 5: Verification, Part 1
The details of the QA tasks are listed in the test plan document.
- [x] 🏆 {+ Quality +}: All "during the blackout" QA automated tests have succeeded
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA manual tests have succeeded
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.
The top priority is to maintain data integrity. Failing back after the blackout
window has ended is very difficult, and will result in any changes made in the
interim being lost.
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
Problems may be categorized into three broad causes - "unknown", "missing data",
or "misconfiguration". Testers should focus on determining which bucket
a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
* A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing
1. [x] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [ ] `emails_on_push` queue
1. [ ] `mailers` queue
1. [ ] (`admin_emails` queue doesn't exist any more)
1. [ ] Rotate the password of the incoming@gitlab.com account and update the vault
1. [ ] Run chef-client and restart mailroom:
* `bundle exec knife ssh role:gprd-base-be-mailroom 'sudo chef-client; sudo gitlab-ctl restart mailroom'`
1. [ ] 🐺 {+Coordinator+}: **PRODUCTION ONLY** Ensure the secondary can send emails
1. [ ] Run the following in a Rails console (changing `you` to yourself): `Notify.test_email("you+test@gitlab.com", "Test email", "test").deliver_now`
1. [ ] Ensure you receive the email
#### Phase 8: Reconfiguration, Part 2
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
- [ ] Update in chef cookbooks by removing the setting entirely
- [ ] Update in the running database
- [ ] On the primary server, run `gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Reduce `statement_timeout` to 15s.
- [ ] Merge and chef this change: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334
- [ ] Close https://gitlab.com/gitlab-com/migration/issues/686
1. [ ] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
* Staging:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* `bundle exec knife ssh role:staging-base-lb-pages 'sudo chef-client'`
* Production:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
* `bundle exec knife ssh role:gitlab-base-lb-pages 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
* Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
* Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
#### Phase 10: Verification, Part 2
1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA manual tests have succeeded
## **PRODUCTION ONLY** Post migration
1. [ ] 🐺 {+ Coordinator +}: Close the failback issue - it isn't needed
1. [ ] ☁ {+ Cloud-conductor +}: Disable unneeded resources in the Azure environment
completion more effectively
* The Pages LB proxy must be retained
* We should retain all filesystem data for a defined period in case of problems (1 week? 3 months?)
* Unused machines can be switched off
1. [ ] ☁ {+ Cloud-conductor +}: Change GitLab settings: [https://gprd.gitlab.com/admin/application_settings](https://gprd.gitlab.com/admin/application_settings)
* Metrics - Influx -> InfluxDB host should be `performance-01-inf-gprd.c.gitlab-production.internal`Nick ThomasNick Thomas