Commit 1811f963 authored by Jose Finotto's avatar Jose Finotto

patroni failover initial commit

parent 50e5155a
Pipeline #99132 passed with stage
in 18 seconds
# Failover Team
| Role | Assigned To |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator | __TEAM_COORDINATOR__ |
| 🔪 Chef-Runner | __TEAM_CHEF_RUNNER__ |
| ☎ Comms-Handler | __TEAM_COMMS_HANDLER__ |
| 🐘 Database-Wrangler | __TEAM_DATABASE_WRANGLER__ |
| ☁ Cloud-conductor | __TEAM_CLOUD_CONDUCTOR__ |
| 🏆 Quality | __TEAM_QUALITY__ |
| ↩ Fail-back Handler (_Staging Only_) | __TEAM_FAILBACK_HANDLER__ |
| 🎩 Head Honcho (_Production Only_) | __TEAM_HEAD_HONCHO__ |
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
# Immediately
Perform these steps when the issue is created.
- [ ] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
- [ ] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
- Start Time: `__MAINTENANCE_START_TIME__` & End Time: `__MAINTENANCE_END_TIME__`
- Google Working Doc: __GOOGLE_DOC_URL__ (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
- **PRODUCTION ONLY** Blog Post: __BLOG_POST_URL__
- **PRODUCTION ONLY** End Time: __MAINTENANCE_END_TIME__
# Support Options
| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
# Database hosts
## Staging
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
## Production
```mermaid
graph TD;
postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
postgres02a --> postgres03a["postgres-03.db.prd"];
postgres02a --> postgres04a["postgres-04.db.prd"];
postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
postgres01g --> postgres02g["postgres-02-db-gprd"];
postgres01g --> postgres03g["postgres-03-db-gprd"];
postgres01g --> postgres04g["postgres-04-db-gprd"];
```
# Console hosts
The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.
* Staging:
* GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
* Production:
* GCP: `web-01-sv-gprd.c.gitlab-production.internal`
# Dashboards and debugging
* These dashboards might be useful during the failover:
* Staging:
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production:
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* Sentry includes application errors. At present, GCP log to the same Sentry instance
* Staging: https://sentry.gitlap.com/gitlab/staginggitlabcom/
* Production:
* Workhorse: https://sentry.gitlap.com/gitlab/gitlab-workhorse-gitlabcom/
* Rails (backend): https://sentry.gitlap.com/gitlab/gitlabcom/
* Rails (frontend): https://sentry.gitlap.com/gitlab/gitlabcom-clientside/
* Gitaly (golang): https://sentry.gitlap.com/gitlab/gitaly-production/
* Gitaly (ruby): https://sentry.gitlap.com/gitlab/gitlabcom-gitaly-ruby/
* The logs can be used to inspect any area of the stack in more detail
* https://log.gitlab.net/
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [ ] ☎ {+ Comms-Handler +}: announce in #general slack and on team call date of failover.
1. [ ] ☎ {+ Comms-Handler +}: Marketing team publish blog post about upcoming GCP failover
1. [ ] ☎ {+ Comms-Handler +}: Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
1. [ ] ☎ {+ Comms-Handler +}: Tweet blog post from `@gitlab` and `@gitlabstatus`
- `Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from __MAINTENANCE_START_TIME__ - __MAINTENANCE_END_TIME__ UTC. Follow @gitlabstatus for more details. __BLOG_POST_URL__`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/040_t-3h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 3600)`
# T minus 1 hour (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/050_t-1h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__`
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `/opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Production
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [ ] ☎ {+ Comms-Handler +}: Create a broadcast message
* Staging: https://staging.gitlab.com/admin/broadcast_messages
* Production: https://gitlab.com/admin/broadcast_messages
* Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from __MAINTENANCE_START_TIME__ on __FAILOVER_DATE__ UTC`
* Start date: now
* End date: now + 3 hours
1. [ ] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
## Failover Call
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.
Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:
* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.
### Roll call
- [ ] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [ ] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call
### Notify Users of Maintenance Window
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side!`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- `GitLab.com planned maintenance for migration to @GCPcloud is starting. See you on the other side!`
### Health check
1. [ ] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the gcp environment.
* Staging
* GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
* Production
* GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
# T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
## Failover Procedure
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
* Run `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Run `knife ssh roles:staging-base-fe-git 'sudo chef-client'`
* Production:
* Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
* Run `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
* Run `knife ssh roles:gitlab-base-fe-git 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
* Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
1. [ ] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
## Finish replicating and verifying all data
#### Phase 2: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [ ] 🐺 {+ Coordinator +}: Reconcile negative registry entries
* Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
* Press "Sync Information"
* Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
* You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
* If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
* On staging, this may not complete
* Fill event log gaps manually, by running the steps in:
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p03/fill-event-log-gaps.rb`
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
* Staging: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
* You can also use `sudo gitlab-rake geo:status`
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
* On staging, verification may not complete
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The loop should be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* `Busy`, `Enqueued`, `Scheduled`, and `Retry` should all be 0
* If a `geo_metrics_update` job is running, that can be ignored
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
* Check that no sidekiq processes show in the GitLab admin panel
1. [ ] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
***REVIEW DATABASE FAILOVER
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in GCP
* Staging: `bin/snapshot-dbs staging`
* Production: `bin/snapshot-dbs production`
1. [ ] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/update-priority.sh
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-priority.sh
```
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
* Keep everything, just ensure it’s turned off on the secondaries. The following script will prompt before shutting down postgresql.
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-secondaries.sh
```
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
* Keep everything, just ensure it’s turned off. The following script will prompt before shutting down postgresql.
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-primary.sh
```
1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
```shell
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/confirm-repmgr.sh
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/connect-pgbouncers.sh
```
1. [ ] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [ ] Promote the desired primary
```shell
$ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
```
- [ ] Instruct the remaining standby nodes to follow the new primary
```shell
$ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
```
*Note*: This will fail on the WAL-E node
1. [ ] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
```bash
/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-gcp-recovery.sh
```
1. [ ] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
* Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
* Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
1. [ ] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
* **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
* **PRODUCTION** **UNTESTED** `knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
* Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that Unicorn processes have been restarted on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
1. [ ] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
* `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
1. [ ] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
* `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
1. [ ] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
* This will automatically re-enable the disabled sidekiq-cron jobs
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
* Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
[ ] Check that sidekiq processes show up in the GitLab admin panel
#### Health check
1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
* Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
* Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
* The old primary in the GCP environment, backed by WAL-E log shipping, will
report "replication lag too large" and "unused replication slot". This is OK.
## During-Blackout QA
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
The details of the QA tasks are listed in the test plan document.
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA automated tests have succeeded
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA manual tests have succeeded
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to REPMGR. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.
The top priority is to maintain data integrity. Failing back after the blackout
window has ended is very difficult, and will result in any changes made in the
interim being lost.
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
Problems may be categorized into three broad causes - "unknown", "missing data",
or "misconfiguration". Testers should focus on determining which bucket
a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
* A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
- [ ] Update in chef cookbooks by removing the setting entirely
- [ ] Update in the running database
- [ ] On the primary server, run `gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Reduce `statement_timeout` to 15s.
- [ ] Merge and chef this change: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334
- [ ] Close https://gitlab.com/gitlab-com/migration/issues/686
1. [ ] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
* Staging:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
* `bundle exec knife ssh role:staging-base-lb-pages 'sudo chef-client'`
* Production:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
* `bundle exec knife ssh role:gitlab-base-lb-pages 'sudo chef-client'`
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
* Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
* Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Update maintenance status on status.io
- https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
- `GitLab.com planned maintenance for migration to @GCPcloud is almost complete. GitLab.com is available although we're continuing to verify that all systems are functioning correctly. ``
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly.`
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA manual tests have succeeded
## **PRODUCTION ONLY** Post migration
1. [ ] 🐺 {+ Coordinator +}: Close the failback issue - it isn't needed
1. [ ] ☁ {+ Cloud-conductor +}: Change GitLab settings: [https://gprd.gitlab.com/admin/application_settings](https://gprd.gitlab.com/admin/application_settings)
* Metrics - Influx -> InfluxDB host should be `performance-01-inf-gprd.c.gitlab-production.internal`
/label ~"Failover Execution"
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment