failover.md 34.6 KB
Newer Older
1 2
# Failover Team

Andrew Newdigate's avatar
Andrew Newdigate committed
3 4 5 6 7 8 9 10 11 12
| Role                                                                   | Assigned To                |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator                                                         | __TEAM_COORDINATOR__       |
| 🔪 Chef-Runner                                                         | __TEAM_CHEF_RUNNER__       |
| ☎ Comms-Handler                                                       | __TEAM_COMMS_HANDLER__     |
| 🐘 Database-Wrangler                                                   | __TEAM_DATABASE_WRANGLER__ |
| ☁ Cloud-conductor                                                     | __TEAM_CLOUD_CONDUCTOR__   |
| 🏆 Quality                                                             | __TEAM_QUALITY__           |
| ↩ Fail-back Handler (_Staging Only_)                                  | __TEAM_FAILBACK_HANDLER__  |
| 🎩 Head Honcho (_Production Only_)                                     | __TEAM_HEAD_HONCHO__       |
13

14
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
15

16

17 18 19 20 21
# Immediately

Perform these steps when the issue is created.

- [ ] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
22
- [ ] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
Andrew Newdigate's avatar
Andrew Newdigate committed
23 24 25 26
    - Start Time: `__MAINTENANCE_START_TIME__` & End Time: `__MAINTENANCE_END_TIME__`
    - Google Working Doc: __GOOGLE_DOC_URL__ (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
    - **PRODUCTION ONLY** Blog Post: __BLOG_POST_URL__
    - **PRODUCTION ONLY** End Time: __MAINTENANCE_END_TIME__
27

28

29 30 31 32 33 34
# Support Options

| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
35

36

37
# Database hosts
38

39

40 41
## Staging

Ian Baum's avatar
Ian Baum committed
42 43 44 45 46 47 48
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
49

50 51 52 53 54 55
## Production

```mermaid
graph TD;
  postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
  postgres02a --> postgres03a["postgres-03.db.prd"];
Ian Baum's avatar
Ian Baum committed
56
  postgres02a --> postgres04a["postgres-04.db.prd"];
57 58 59 60 61 62 63
  postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
  postgres01g --> postgres02g["postgres-02-db-gprd"];
  postgres01g --> postgres03g["postgres-03-db-gprd"];
  postgres01g --> postgres04g["postgres-04-db-gprd"];
```


64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
# Console hosts

The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.

* Staging:
  * Azure: `web-01.sv.stg.gitlab.com`
  * GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
* Production:
  * Azure: `web-01.sv.prd.gitlab.com`
  * GCP: `web-01-sv-gprd.c.gitlab-production.internal`

79

80 81 82 83 84 85 86 87 88 89
# Grafana dashboards

These dashboards might be useful during the failover:

* Staging:
  * Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
  * GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production:
  * Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
  * GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
90

91

Andrew Newdigate's avatar
Andrew Newdigate committed
92
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
93

Andrew Newdigate's avatar
Andrew Newdigate committed
94
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
95 96
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!

97

Andrew Newdigate's avatar
Andrew Newdigate committed
98
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
99

100
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
101 102 103 104
1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google
1. [ ] ☎ {+ Comms-Handler +}: announce in #general slack and on team call date of failover.
1. [ ] ☎ {+ Comms-Handler +}: Marketing team publish blog post about upcoming GCP failover
1. [ ] ☎ {+ Comms-Handler +}: Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
105
    - Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
106
    - Details of our backup policies to assure users that their data is safe
107
    - Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
108 109
1. [ ] ☎ {+ Comms-Handler +}: Ensure that YouTube stream will be available for Zoom call
1. [ ] ☎ {+ Comms-Handler +}: Tweet blog post from `@gitlab` and `@gitlabstatus`
Andrew Newdigate's avatar
Andrew Newdigate committed
110
    -  `Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from __MAINTENANCE_START_TIME__ - __MAINTENANCE_END_TIME__ UTC. Follow @gitlabstatus for more details. __BLOG_POST_URL__`
111
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
Andrew Newdigate's avatar
Andrew Newdigate committed
112

113

Andrew Newdigate's avatar
Andrew Newdigate committed
114
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
Andrew Newdigate's avatar
Andrew Newdigate committed
115

116
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
Andrew Newdigate's avatar
Andrew Newdigate committed
117
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
Andrew Newdigate's avatar
Andrew Newdigate committed
118
    -  Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
119
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
Andrew Newdigate's avatar
Andrew Newdigate committed
120
    -  Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
121

Andrew Newdigate's avatar
Andrew Newdigate committed
122
# T minus 3 hours (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/040_t-3h)
123 124 125 126 127 128 129 130

**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover

1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
    * In a Rails console, run:
    * `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`


Andrew Newdigate's avatar
Andrew Newdigate committed
131
# T minus 1 hour (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/050_t-1h)
132 133 134 135 136

**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover

GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
137 138
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
139

140
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
Andrew Newdigate's avatar
Andrew Newdigate committed
141
    -  `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__`
142
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
Andrew Newdigate's avatar
Andrew Newdigate committed
143
    -  `/opt/gitlab-migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
144 145 146
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
    - `environment`: `prd`
147
1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
148
    * Block `POST /api/v4/jobs/request`
149 150 151 152
    * Staging
        * https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
        * `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
    * Production
153 154 155
        * [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
            - `environment`: `prd`
            - `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
156 157
        * https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
        * `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
158
- [ ] ☎ {+ Comms-Handler +}: Create a broadcast message
159
    * Staging: https://staging.gitlab.com/admin/broadcast_messages
160
    * Production: https://gitlab.com/admin/broadcast_messages
161 162 163
    * Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from XX:XX on 2018-XX-YY UTC`
    * Start date: now
    * End date: now + 3 hours
164 165 166
1. [ ] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
    * Staging: `bin/snapshot-dbs staging`
    * Production: `bin/snapshot-dbs production`
167 168
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
    * Disable the cronjob on the **Azure** pages NFS server
169
    * `sudo crontab -e` to get an editor window, comment out the line involving rsync
170 171
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
    * Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
172
    * Updates to pages after the transfer starts will be lost.
John Jarvis's avatar
John Jarvis committed
173
    * The user running the rsync _must_ have full sudo access on both azure and gcp pages.
174
    * Very manual, looks a little like the following at present:
Ian Baum's avatar
Ian Baum committed
175
    * Staging:
Andrew Newdigate's avatar
Andrew Newdigate committed
176

Andrew Newdigate's avatar
Andrew Newdigate committed
177
        ```
178
        ssh 10.133.2.161 # nfs-pages-staging-01.stor.gitlab.com
179
        tmux
180
        sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
181
        ```
Andrew Newdigate's avatar
Andrew Newdigate committed
182
    * Production:
John Jarvis's avatar
John Jarvis committed
183

184
        ```
185
        ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com
186
        tmux
187
        sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
188
        ```
189

190

Andrew Newdigate's avatar
Andrew Newdigate committed
191
# T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)
Andrew Newdigate's avatar
Andrew Newdigate committed
192

193 194
We expect the maintenance window to last for up to 2 hours, starting from now.

195

196
## Failover Procedure
Andrew Newdigate's avatar
Andrew Newdigate committed
197

198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.

Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:

* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)

Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.

Andrew Newdigate's avatar
Andrew Newdigate committed
214

215 216 217 218 219 220
### Roll call

- [ ] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [ ] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call


221 222
### Notify Users of Maintenance Window

223 224
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
    -  `GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube`
225

226

227
### Monitoring
228

229
- [ ] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
230
    * Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
231
    * Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe0{1..9}.lb.gitlab.com fe{10..16}.lb.gitlab.comalt ssh0{1..2}.lb.gitlab.com`
232 233


234
### Health check
235

236 237 238 239 240 241 242
1. [ ] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
    * Staging
      * Staging
          * GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
          * Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
      * Production
          * GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
digitalMoksha's avatar
digitalMoksha committed
243
          * Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
244

245 246
### Prevent updates to the primary

247

Andrew Newdigate's avatar
Andrew Newdigate committed
248
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
Andrew Newdigate's avatar
Andrew Newdigate committed
249

250
1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
251 252 253
    * Staging
        * Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
        * Run `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
254
        * Run `knife ssh roles:staging-base-fe-git 'sudo chef-client'`
255
    * Production:
256
        * Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
257
        * Run `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
258
        * Run `knife ssh roles:gitlab-base-fe-git 'sudo chef-client'`
259
1. [ ] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
260 261 262
    * This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
    * Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
    * Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
263 264
1. [ ] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
    * Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
265 266

Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
Andrew Newdigate's avatar
Andrew Newdigate committed
267

268

Andrew Newdigate's avatar
Andrew Newdigate committed
269
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
Andrew Newdigate's avatar
Andrew Newdigate committed
270

271
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
272
    * `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
273
1. [ ] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
274 275 276
    * `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh`
1. [ ] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
    * In a separate terminal on the deploy host: `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
277
    * The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
278 279 280 281 282 283
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
    * Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
    * Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
    * Expand the `Verification Info` tab
    * Wait for the number of `unverified` repositories to reach 0
    * Resolve any repositories that have `failed` verification
284
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
285
    * Staging: https://staging.gitlab.com/admin/background_jobs
286 287 288 289 290 291 292 293
    * Production: https://gitlab.com/admin/background_jobs
    * Press `Queues -> Live Poll`
    * Wait for all queues not mentioned above to reach 0
    * Wait for the number of `Enqueued` and `Busy` jobs to reach 0
    * On staging, the repository verification queue may not empty
1. [ ] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
    * Staging: https://staging.gitlab.com/admin/sidekiq/retries
    * Production: https://gitlab.com/admin/sidekiq/retries
294
    * **NOTE**: This tab may contain confidential information. Do this out of screen capture!
295 296 297 298 299
    * Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
    * Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
    * Press "Retry All" to attempt to retry all remaining jobs immediately
    * Repeat until 0 retries are present
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
Ahmad Sherif's avatar
Ahmad Sherif committed
300 301
    * Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
    * Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
302
    * Check that no sidekiq processes show in the GitLab admin panel
303 304 305

At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
306

307

308 309
## Finish replicating and verifying all data

310

Andrew Newdigate's avatar
Andrew Newdigate committed
311
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
Andrew Newdigate's avatar
Andrew Newdigate committed
312

313
1. [ ] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
314
    * [ ] CI traces in Redis
315
        * Run `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
316
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
317 318 319 320 321 322 323
    * Staging: https://gstg.gitlab.com/admin/geo_nodes
    * Production: https://gprd.gitlab.com/admin/geo_nodes
    * Press "Sync Information"
    * Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
    * You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
    * If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
    * On staging, this may not complete
324
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
325 326 327 328 329
    * Press "Verification Information"
    * Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
    * You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
    * If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
    * On staging, verification may not complete
330
1. [ ] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
331 332 333
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
    * Staging: `postgres-01.db.gstg.gitlab.com`
    * Production: `postgres-01-db-gprd.c.gitlab-production.internal`
334 335 336
    * `sudo gitlab-psql -d gitlabhq_production -c "SELECT now() - pg_last_xact_replay_timestamp();"`
    * Assuming the clocks are in sync, this value should be close to 0
    * If this is a large number, GCP may not have some data that is in Azure
337
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
338 339 340
    * In a dedicated rails console on the **secondary**:
    * `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
    * The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
341
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
342 343 344 345
    * Review status of the running Sidekiq monitor script started in [phase 2, above](#phase-2-commence-shutdown-in-azure-), wait for `--> Status: PROCEED`
    * Need more details?
        * Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
        * Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
346 347
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
    * This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
Ahmad Sherif's avatar
Ahmad Sherif committed
348 349
    * Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
    * Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
350
    * Check that no sidekiq processes show in the GitLab admin panel
351 352 353 354

At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
355

Nick Thomas's avatar
Nick Thomas committed
356 357 358 359
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.

360

361 362
## Promote the secondary

363

Andrew Newdigate's avatar
Andrew Newdigate committed
364
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
Andrew Newdigate's avatar
Andrew Newdigate committed
365

366 367 368
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
    * Staging: `bin/snapshot-dbs staging`
    * Production: `bin/snapshot-dbs production`
369 370 371
1. [ ] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
    * The incremental `rsync` commands set off above should be completed by now
    * If still ongoing, the DNS update will cause some Pages sites to temporarily revert
372
1. [ ] ☁ {+ Cloud-conductor +}: Update DNS entries to refer to the GCP load-balancers
373
    * Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
374
    * Staging
375 376 377 378
        - [ ] `staging.gitlab.com A 35.227.123.228`
        - [ ] `altssh.staging.gitlab.com A 35.185.33.132`
        - [ ] `*.staging.gitlab.io A 35.229.69.78`
        - **DO NOT** change `staging.gitlab.io`.
379
    * Production **UNTESTED**
380 381 382 383
        - [ ] `gitlab.com A 35.231.145.151`
        - [ ] `altssh.gitlab.com A 35.190.168.187`
        - [ ] `*.gitlab.io A 35.185.44.232`
        - **DO NOT** change `gitlab.io`.
384
1. [ ] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
385 386

    ```shell
387
    # gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=100 where name like '%gstg%'"
388
    ```
389 390
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
    * Keep everything, just ensure it’s turned off
391

Ian Baum's avatar
Ian Baum committed
392
    ```shell
393
    $ knife ssh "role:staging-base-db-postgres AND NOT fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
Ian Baum's avatar
Ian Baum committed
394
    ```
395 396
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
    * Keep everything, just ensure it’s turned off
397 398

    ```shell
399
    $ knife ssh "fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
400
    ```
401
1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
402
     - [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
403 404
     - [ ] Confirm pgbouncer node in GCP (Password is in 1password)

405 406 407 408 409 410 411 412 413 414
        ```shell
        $ gitlab-ctl pgb-console
        ...
        pgbouncer# SHOW DATABASES;
        # You want to see lines like
        gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production |            |       100 |            5 |           |               0 |                   0
        gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production |            |       150 |            5 |           |               0 |                   0
        ...
        pgbouncer# SHOW SERVERS;
        # You want to see lines like
Ian Baum's avatar
Ian Baum committed
415
          S    | gitlab    | gitlabhq_production | idle  | PRIMARY_IP | 5432 | PGBOUNCER_IP |      54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 |    |      19430 |
416
        ```
Ian Baum's avatar
Ian Baum committed
417 418 419 420 421 422 423
1. [ ] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
     - [ ] Promote the desired primary

        ```shell
        $ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
        ```
     - [ ] Instruct the remaining standby nodes to follow the new primary
424

Ian Baum's avatar
Ian Baum committed
425 426 427 428
         ```shell
         $ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
         ```
         *Note*: This will fail on the WAL-E node
429
1. [ ] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
430 431 432
    * Connect to the newly promoted primary in GCP
    * `sudo gitlab-psql -d gitlabhq_production -c "select * from pg_is_in_recovery();"`
    * The result should be `F`
Ian Baum's avatar
Ian Baum committed
433
1. [ ] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
434 435
    * Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
    * Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
436
1. [ ] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
Ahmad Sherif's avatar
Ahmad Sherif committed
437 438
    * **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
    * **PRODUCTION** **UNTESTED** `knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
439
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
440 441
    * Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
    * Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
442
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that important processes have been restarted on all hosts
443 444
    * Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
    * Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
445 446
    * [ ] Unicorn
    * [ ] Sidekiq
Andrew Newdigate's avatar
Andrew Newdigate committed
447
    * [ ] Gitlab Pages
448
1. [ ] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
449
    * This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
450 451 452
    * In a Rails console in GCP:
        * Staging: `GeoNode.where(url: "https://gstg.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
        * Production: `GeoNode.where(url: "https://gprd.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
453
1. [ ] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
454
    * `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
455 456
1. [ ] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
    * `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
457 458
1. [ ] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
    * This will automatically re-enable the disabled sidekiq-cron jobs
459 460
    * Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
    * Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
461
    [ ] Check that sidekiq processes show up in the GitLab admin panel
462

463

464 465 466 467 468
#### Health check

1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
    * Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
    * Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
469 470
    * The old primary in the GCP environment, backed by WAL-E log shipping, will
      report "replication lag too large" and "unused replication slot". This is OK.
471

472

473
## During-Blackout QA
474

Andrew Newdigate's avatar
Andrew Newdigate committed
475
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
Andrew Newdigate's avatar
Andrew Newdigate committed
476

477 478
The details of the QA tasks are listed in the test plan document.

479 480
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA automated tests have succeeded
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA manual tests have succeeded
481

482 483 484

## Evaluation of QA results - **Decision Point**

485

Andrew Newdigate's avatar
Andrew Newdigate committed
486
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
Andrew Newdigate's avatar
Andrew Newdigate committed
487

488 489 490 491 492
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.

493
The top priority is to maintain data integrity. Failing back after the blackout
494 495 496
window has ended is very difficult, and will result in any changes made in the
interim being lost.

497
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
498

499
Problems may be categorized into three broad causes - "unknown", "missing data",
500
or "misconfiguration". Testers should focus on determining which bucket
501 502 503 504 505 506 507 508 509
a failure falls into, as quickly as possible.

Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.

We should abort for failures caused by missing data unless all the following apply:

* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
510
* A named person will own back-filling the missing data on the same day
511 512 513 514 515 516 517 518 519 520 521

We should abort for failures caused by misconfiguration unless all the following apply:

* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day

If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.

522

523
## Complete the Migration (T plus 2 hours)
524

525

Andrew Newdigate's avatar
Andrew Newdigate committed
526
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
Andrew Newdigate's avatar
Andrew Newdigate committed
527

Andreas Brandl's avatar
Andreas Brandl committed
528
1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
529 530
    1. [ ] `emails_on_push` queue
    1. [ ] `mailers` queue
531
    1. [ ] (`admin_emails` queue doesn't exist any more)
Stan Hu's avatar
Stan Hu committed
532 533
    1. [ ] Rotate the password of the incoming@gitlab.com account and update the vault
    1. [ ] Run chef-client and restart mailroom:
534
        * `bundle exec knife ssh role:gprd-base-be-mailroom 'sudo chef-client; sudo gitlab-ctl restart mailroom'`
535 536 537
1. [ ] 🐺 {+Coordinator+}: **PRODUCTION ONLY** Ensure the secondary can send emails
    1. [ ] Run the following in a Rails console (changing `you` to yourself): `Notify.test_email("you+test@gitlab.com", "Test email", "test").deliver_now`
    1. [ ] Ensure you receive the email
Andrew Newdigate's avatar
Andrew Newdigate committed
538

539

Andrew Newdigate's avatar
Andrew Newdigate committed
540
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
Andrew Newdigate's avatar
Andrew Newdigate committed
541

Ian Baum's avatar
Ian Baum committed
542
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
543
     - [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
544
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
545 546 547
   - [ ] Update in chef cookbooks by removing the setting entirely
   - [ ] Update in the running database
       - [ ] On the primary server, run `gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'`
548
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Reduce `statement_timeout` to 15s.
549 550
   - [ ] Merge and chef this change: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334
   - [ ] Close https://gitlab.com/gitlab-com/migration/issues/686
551
1. [ ] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
552 553 554 555 556 557
    * Staging:
        * Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
        * `bundle exec knife ssh role:staging-base-lb-pages 'sudo chef-client'`
    * Production:
        * Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
        * `bundle exec knife ssh role:gitlab-base-lb-pages 'sudo chef-client'`
558
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
559
    * Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
560
    * Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322
561 562 563
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
    * In a Rails console, run:
    * `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
564

Andrew Newdigate's avatar
Andrew Newdigate committed
565
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
Andrew Newdigate's avatar
Andrew Newdigate committed
566

567
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
568
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
569
    -  `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
Andrew Newdigate's avatar
Andrew Newdigate committed
570

571

Andrew Newdigate's avatar
Andrew Newdigate committed
572
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
Andrew Newdigate's avatar
Andrew Newdigate committed
573 574 575 576

1. **Start After-Blackout QA** This is the second half of the test plan.
    1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
    1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA manual tests have succeeded
577

578

579
## **PRODUCTION ONLY** Post migration
580

581
1. [ ] 🐺 {+ Coordinator +}: Close the failback issue - it isn't needed
582
1. [ ] ☁ {+ Cloud-conductor +}: Disable unneeded resources in the Azure environment
583
 completion more effectively
584 585
    * The Pages LB proxy must be retained
    * We should retain all filesystem data for a defined period in case of problems (1 week? 3 months?)
586
    * Unused machines can be switched off
587
1. [ ] ☁ {+ Cloud-conductor +}: Change GitLab settings: [https://gprd.gitlab.com/admin/application_settings](https://gprd.gitlab.com/admin/application_settings)
Stan Hu's avatar
Stan Hu committed
588
    * Metrics - Influx -> InfluxDB host should be `performance-01-inf-gprd.c.gitlab-production.internal`
589

590

591
/label ~"Failover Execution"