failover.md 36.5 KB
Newer Older
1
2
# Failover Team

Andrew Newdigate's avatar
Andrew Newdigate committed
3
4
5
6
7
8
9
10
11
12
| Role                                                                   | Assigned To                |
| -----------------------------------------------------------------------|----------------------------|
| 🐺 Coordinator                                                         | __TEAM_COORDINATOR__       |
| 🔪 Chef-Runner                                                         | __TEAM_CHEF_RUNNER__       |
| ☎ Comms-Handler                                                       | __TEAM_COMMS_HANDLER__     |
| 🐘 Database-Wrangler                                                   | __TEAM_DATABASE_WRANGLER__ |
| ☁ Cloud-conductor                                                     | __TEAM_CLOUD_CONDUCTOR__   |
| 🏆 Quality                                                             | __TEAM_QUALITY__           |
| ↩ Fail-back Handler (_Staging Only_)                                  | __TEAM_FAILBACK_HANDLER__  |
| 🎩 Head Honcho (_Production Only_)                                     | __TEAM_HEAD_HONCHO__       |
13

14
(try to ensure that 🔪, ☁ and ↩ are always the same person for any given run)
15

16

17
18
19
20
21
# Immediately

Perform these steps when the issue is created.

- [ ] 🐺 {+ Coordinator +}: Fill out the names of the failover team in the table above.
22
- [ ] 🐺 {+ Coordinator +}: Fill out dates/times and links in this issue:
Andrew Newdigate's avatar
Andrew Newdigate committed
23
24
25
26
    - Start Time: `__MAINTENANCE_START_TIME__` & End Time: `__MAINTENANCE_END_TIME__`
    - Google Working Doc: __GOOGLE_DOC_URL__ (for PRODUCTION, create a new doc and make it writable for GitLabbers, and readable for the world)
    - **PRODUCTION ONLY** Blog Post: __BLOG_POST_URL__
    - **PRODUCTION ONLY** End Time: __MAINTENANCE_END_TIME__
27

28

29
30
31
32
33
34
# Support Options

| Provider | Plan | Details | Create Ticket |
|----------|------|---------|---------------|
| **Microsoft Azure** |[Profession Direct Support](https://azure.microsoft.com/en-gb/support/plans/) | 24x7, email & phone, 1 hour turnaround on Sev A | [**Create Azure Support Ticket**](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest) |
| **Google Cloud Platform** | [Gold Support](https://cloud.google.com/support/?options=premium-support#options) | 24x7, email & phone, 1hr response on critical issues | [**Create GCP Support Ticket**](https://enterprise.google.com/supportcenter/managecases) |
35

36

37
# Database hosts
38

39

40
41
## Staging

Ian Baum's avatar
Ian Baum committed
42
43
44
45
46
47
48
```mermaid
graph TD;
postgres02a["postgres02.db.stg.gitlab.com (Current primary)"] --> postgres01a["postgres01.db.stg.gitlab.com"];
postgres02a -->|WAL-E| postgres02g["postgres-02.db.gstg.gitlab.com"];
postgres02g --> postgres01g["postgres-01.db.gstg.gitlab.com"];
postgres02g --> postgres03g["postgres-03.db.gstg.gitlab.com"];
```
49

50
51
52
53
54
55
## Production

```mermaid
graph TD;
  postgres02a["postgres-02.db.prd (Current primary)"] --> postgres01a["postgres-01.db.prd"];
  postgres02a --> postgres03a["postgres-03.db.prd"];
Ian Baum's avatar
Ian Baum committed
56
  postgres02a --> postgres04a["postgres-04.db.prd"];
57
58
59
60
61
62
63
  postgres02a -->|WAL-E| postgres01g["postgres-01-db-gprd"];
  postgres01g --> postgres02g["postgres-02-db-gprd"];
  postgres01g --> postgres03g["postgres-03-db-gprd"];
  postgres01g --> postgres04g["postgres-04-db-gprd"];
```


64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Console hosts

The usual rails and database console access hosts are broken during the
failover. Any shell commands should, instead, be run on the following
machines by SSHing to them. Rails console commands should also be run on these
machines, by SSHing to them and issuing a `sudo gitlab-rails console` command
first.

* Staging:
  * Azure: `web-01.sv.stg.gitlab.com`
  * GCP: `web-01-sv-gstg.c.gitlab-staging-1.internal`
* Production:
  * Azure: `web-01.sv.prd.gitlab.com`
  * GCP: `web-01-sv-gprd.c.gitlab-production.internal`

79

80
# Dashboards and debugging
81

82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
* These dashboards might be useful during the failover:
    * Staging:
        * Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
        * GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
    * Production:
        * Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
        * GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* Sentry includes application errors. At present, Azure and GCP log to the same Sentry instance
    * Staging: https://sentry.gitlap.com/gitlab/staginggitlabcom/
    * Production:
        * Workhorse: https://sentry.gitlap.com/gitlab/gitlab-workhorse-gitlabcom/
        * Rails (backend): https://sentry.gitlap.com/gitlab/gitlabcom/
        * Rails (frontend): https://sentry.gitlap.com/gitlab/gitlabcom-clientside/
        * Gitaly (golang): https://sentry.gitlap.com/gitlab/gitaly-production/
        * Gitaly (ruby): https://sentry.gitlap.com/gitlab/gitlabcom-gitaly-ruby/
* The logs can be used to inspect any area of the stack in more detail
    * https://log.gitlab.net/
99

100

Andrew Newdigate's avatar
Andrew Newdigate committed
101
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
102

Andrew Newdigate's avatar
Andrew Newdigate committed
103
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
104
105
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!

106

Andrew Newdigate's avatar
Andrew Newdigate committed
107
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
108

109
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
110
111
112
113
1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google
1. [ ] ☎ {+ Comms-Handler +}: announce in #general slack and on team call date of failover.
1. [ ] ☎ {+ Comms-Handler +}: Marketing team publish blog post about upcoming GCP failover
1. [ ] ☎ {+ Comms-Handler +}: Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
114
    - Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
115
    - Details of our backup policies to assure users that their data is safe
116
    - Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
117
118
1. [ ] ☎ {+ Comms-Handler +}: Ensure that YouTube stream will be available for Zoom call
1. [ ] ☎ {+ Comms-Handler +}: Tweet blog post from `@gitlab` and `@gitlabstatus`
Andrew Newdigate's avatar
Andrew Newdigate committed
119
    -  `Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from __MAINTENANCE_START_TIME__ - __MAINTENANCE_END_TIME__ UTC. Follow @gitlabstatus for more details. __BLOG_POST_URL__`
120
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
Andrew Newdigate's avatar
Andrew Newdigate committed
121

122

Andrew Newdigate's avatar
Andrew Newdigate committed
123
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
Andrew Newdigate's avatar
Andrew Newdigate committed
124

125
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
Andrew Newdigate's avatar
Andrew Newdigate committed
126
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
127
    -  Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
128
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
129
    -  Tweet content from `/opt/gitlab-migration/migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
130

Andrew Newdigate's avatar
Andrew Newdigate committed
131
# T minus 3 hours (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/040_t-3h)
132
133
134
135
136

**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover

1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
    * In a Rails console, run:
137
    * `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 3600)`
138
139


Andrew Newdigate's avatar
Andrew Newdigate committed
140
# T minus 1 hour (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/050_t-1h)
141
142
143
144
145

**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover

GitLab runners attempting to post artifacts back to GitLab.com during the
maintenance window will fail and the artifacts may be lost. To avoid this as
146
147
much as possible, we'll stop any new runner jobs from being picked up, starting
an hour before the scheduled maintenance window.
148

149
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
Andrew Newdigate's avatar
Andrew Newdigate committed
150
    -  `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__`
151
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
152
    -  `/opt/gitlab-migration/migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
153
154
155
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
    - `environment`: `prd`
156
157
1. [ ] **PRODUCTION ONLY** 🔪 {+ Chef-Runner +}: Silence production alerts
    * [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
158
        * `provider`: `azure`
159
        * `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
160
1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
161
    * Block `POST /api/v4/jobs/request`
162
163
164
165
166
167
    * Staging
        * https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
        * `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
    * Production
        * https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
        * `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
168
- [ ] ☎ {+ Comms-Handler +}: Create a broadcast message
169
    * Staging: https://staging.gitlab.com/admin/broadcast_messages
170
    * Production: https://gitlab.com/admin/broadcast_messages
171
    * Text: `gitlab.com is [moving to a new home](https://about.gitlab.com/2018/04/05/gke-gitlab-integration/)! Hold on to your hats, we’re going dark for approximately 2 hours from __MAINTENANCE_START_TIME__ on __FAILOVER_DATE__ UTC`
172
173
    * Start date: now
    * End date: now + 3 hours
174
175
176
1. [ ] ☁ {+ Cloud-conductor +}: Initial snapshot of database disks in case of failback in Azure and GCP
    * Staging: `bin/snapshot-dbs staging`
    * Production: `bin/snapshot-dbs production`
177
178
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
    * Disable the cronjob on the **Azure** pages NFS server
179
    * This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
180
    * `sudo crontab -e` to get an editor window, comment out the line involving a pages-sync script
181
182
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
    * Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
183
    * Updates to pages after the transfer starts will be lost.
John Jarvis's avatar
John Jarvis committed
184
    * The user running the rsync _must_ have full sudo access on both azure and gcp pages.
185
    * Very manual, looks a little like the following at present:
Ian Baum's avatar
Ian Baum committed
186
    * Staging:
Andrew Newdigate's avatar
Andrew Newdigate committed
187

Andrew Newdigate's avatar
Andrew Newdigate committed
188
        ```
189
        ssh 10.133.2.161 # nfs-pages-staging-01.stor.gitlab.com
190
        tmux
191
        sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
192
        ```
Andrew Newdigate's avatar
Andrew Newdigate committed
193
    * Production:
John Jarvis's avatar
John Jarvis committed
194

195
        ```
196
        ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com
197
        tmux
198
        sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
199
        ```
200

201

202
## Failover Call
Andrew Newdigate's avatar
Andrew Newdigate committed
203

204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
These steps will be run in a Zoom call. The 🐺 {+ Coordinator +} runs the call,
asking other roles to perform each step on the checklist at the appropriate
time.

Changes are made one at a time, and verified before moving onto the next step.
Whoever is performing a change should share their screen and explain their
actions as they work through them. Everyone else should watch closely for
mistakes or errors! A few things to keep an especially sharp eye out for:

* Exposed credentials (except short-lived items like 2FA codes)
* Running commands against the wrong hosts
* Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)

Remember that the intention is for the call to be broadcast live on the day. If
you see something happening that shouldn't be public, mention it.

Andrew Newdigate's avatar
Andrew Newdigate committed
220

221
222
### Roll call

223
- [ ] 🐺 {+ Comms-Handler +}: make sure Youtube stream is started
224
225
226
- [ ] 🐺 {+ Coordinator +}: Ensure everyone mentioned above is on the call
- [ ] 🐺 {+ Coordinator +}: Ensure the Zoom room host is on the call

227
228
### Notify Users of Maintenance Window

229
230
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
    -  `GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube`
231
232
233
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Update maintenance status on status.io
    -  https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
    - `GitLab.com  planned maintenance for migration to @GCPcloud is starting. See you on the other side! We'll be live on YouTube`
234

235
### Monitoring
236

237
- [ ] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
238
    * Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
239
    * Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe0{1..9}.lb.gitlab.com fe{10..16}.lb.gitlab.com altssh0{1..2}.lb.gitlab.com`
240
241


242
### Health check
243

244
245
1. [ ] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
    * Staging
Nick Thomas's avatar
Nick Thomas committed
246
247
248
249
250
        * GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
        * Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
    * Production
        * GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
        * Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
251

252
253
254
# T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)

We expect the maintenance window to last for up to 2 hours, starting from now.
255

256
257
258
## Failover Procedure

### Prevent updates to the primary
259

Andrew Newdigate's avatar
Andrew Newdigate committed
260
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
Andrew Newdigate's avatar
Andrew Newdigate committed
261

262
1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
263
264
265
    * Staging
        * Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
        * Run `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
266
        * Run `knife ssh roles:staging-base-fe-git 'sudo chef-client'`
267
    * Production:
268
        * Apply this MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254
269
        * Run `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
270
        * Run `knife ssh roles:gitlab-base-fe-git 'sudo chef-client'`
271
1. [ ] 🔪 {+ Chef-Runner +}: Restart HAProxy on all LBs to terminate any on-going connections
272
273
274
    * This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
    * Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
    * Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
275
276
1. [ ] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
    * Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
277
278

Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
Andrew Newdigate's avatar
Andrew Newdigate committed
279

280

Andrew Newdigate's avatar
Andrew Newdigate committed
281
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
Andrew Newdigate's avatar
Andrew Newdigate committed
282

283
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
284
    * `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/010-stop-mailroom.sh`
285
1. [ ] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
286
    * `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/020-stop-sidekiq-pullmirror.sh`
287
1. [ ] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
288
    * In a separate terminal on the deploy host: `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
289
    * The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
Brett Walker's avatar
Brett Walker committed
290
    * The loop should be stopped once sidekiq is shut down
291
    * Wait for `--> Status: PROCEED`
292
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
293
294
295
    * Staging: https://performance.gitlab.net/d/000000286/gcp-failover-azure?orgId=1&var-environment=stg
    * Production: https://performance.gitlab.net/d/000000286/gcp-failover-azure?orgId=1&var-environment=prd
    * Wait for the number of `unverified` repositories and wikis to reach 0
296
    * Resolve any repositories that have `failed` verification
297
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
298
    * Staging: https://staging.gitlab.com/admin/background_jobs
299
300
301
302
303
304
305
306
    * Production: https://gitlab.com/admin/background_jobs
    * Press `Queues -> Live Poll`
    * Wait for all queues not mentioned above to reach 0
    * Wait for the number of `Enqueued` and `Busy` jobs to reach 0
    * On staging, the repository verification queue may not empty
1. [ ] 🐺 {+ Coordinator +}: Handle Sidekiq jobs in the "retry" state
    * Staging: https://staging.gitlab.com/admin/sidekiq/retries
    * Production: https://gitlab.com/admin/sidekiq/retries
307
    * **NOTE**: This tab may contain confidential information. Do this out of screen capture!
308
309
310
311
312
    * Delete jobs in idempotent or transient queues (`reactive_caching` or `repository_update_remote_mirror`, for instance)
    * Delete jobs in other queues that are failing due to application bugs (error contains `NoMethodError`, for instance)
    * Press "Retry All" to attempt to retry all remaining jobs immediately
    * Repeat until 0 retries are present
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in Azure
Ahmad Sherif's avatar
Ahmad Sherif committed
313
314
    * Staging: `knife ssh roles:staging-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
    * Production: `knife ssh roles:gitlab-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
315
    * Check that no sidekiq processes show in the GitLab admin panel
Brett Walker's avatar
Brett Walker committed
316
1. [ ] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
317
318
319

At this point, the primary can no longer receive any updates. This allows the
state of the secondary to converge.
320

321

322
323
## Finish replicating and verifying all data

324

Andrew Newdigate's avatar
Andrew Newdigate committed
325
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
Andrew Newdigate's avatar
Andrew Newdigate committed
326

327
328
329
1. [ ] 🐺 {+ Coordinator +}: Flush CI traces in Redis to the database
    * In a Rails console in Azure:
    * `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
330
331
1. [ ] 🐺 {+ Coordinator +}: Reconcile negative registry entries
    * Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md
332
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
333
334
335
336
337
338
339
    * Staging: https://gstg.gitlab.com/admin/geo_nodes
    * Production: https://gprd.gitlab.com/admin/geo_nodes
    * Press "Sync Information"
    * Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
    * You can use `sudo gitlab-rake geo:status` instead if the UI is non-compliant
    * If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
    * On staging, this may not complete
340
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become verified
341
342
    * Staging: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
    * Production: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
343
    * Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
344
    * You can also use `sudo gitlab-rake geo:status`
345
346
    * If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
    * On staging, verification may not complete
347
348
349
350
351
1. [ ] 🐺 {+ Coordinator +}: Ensure the whole event log has been processed
    * In Azure: `Geo::EventLog.maximum(:id)`
    * In GCP: `Geo::EventLogState.last_processed.id`
    * The two numbers should be the same
1. [ ]  🐺 {+ Coordinator +}: Ensure the prospective failover target in GCP is up to date
John Jarvis's avatar
John Jarvis committed
352
    * `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p03/check-wal-secondary-sync.sh`
353
354
    * Assuming the clocks are in sync, this value should be close to 0
    * If this is a large number, GCP may not have some data that is in Azure
355
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
356
357
    * In a dedicated rails console on the **secondary**:
    * `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
Brett Walker's avatar
Brett Walker committed
358
    * The loop should be stopped once sidekiq is shut down
359
    * The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
360
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
361
362
363
364
    * Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
    * Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
    * `Busy`, `Enqueued`, `Scheduled`, and `Retry` should all be 0
    * If a `geo_metrics_update` job is running, that can be ignored
365
366
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
    * This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
Ahmad Sherif's avatar
Ahmad Sherif committed
367
368
    * Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
    * Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
369
    * Check that no sidekiq processes show in the GitLab admin panel
Brett Walker's avatar
Brett Walker committed
370
1. [ ] 🐺 {+ Coordinator +}: Stop the Sidekiq queue disabling loop from above
371
372
373
374

At this point all data on the primary should be present in exactly the same form
on the secondary. There is no outstanding work in sidekiq on the primary or
secondary, and if we failover, no data will be lost.
375

Nick Thomas's avatar
Nick Thomas committed
376
377
378
379
Stopping all cronjobs on the secondary means it will no longer attempt to run
background synchronization operations against the primary, reducing the chance
of errors while it is being promoted.

380

381
382
## Promote the secondary

383

Andrew Newdigate's avatar
Andrew Newdigate committed
384
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
Andrew Newdigate's avatar
Andrew Newdigate committed
385

386
387
388
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
    * Staging: `bin/snapshot-dbs staging`
    * Production: `bin/snapshot-dbs production`
389
390
391
1. [ ] 🔪 {+ Chef-Runner +}: Ensure GitLab Pages sync is completed
    * The incremental `rsync` commands set off above should be completed by now
    * If still ongoing, the DNS update will cause some Pages sites to temporarily revert
392
1. [ ] ☁ {+ Cloud-conductor +}: Update DNS entries to refer to the GCP load-balancers
393
    * Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
394
    * Staging
395
396
397
398
        - [ ] `staging.gitlab.com A 35.227.123.228`
        - [ ] `altssh.staging.gitlab.com A 35.185.33.132`
        - [ ] `*.staging.gitlab.io A 35.229.69.78`
        - **DO NOT** change `staging.gitlab.io`.
399
    * Production **UNTESTED**
400
401
402
403
        - [ ] `gitlab.com A 35.231.145.151`
        - [ ] `altssh.gitlab.com A 35.190.168.187`
        - [ ] `*.gitlab.io A 35.185.44.232`
        - **DO NOT** change `gitlab.io`.
404
1. [ ] 🐘 {+ Database-Wrangler +}: Update the priority of GCP nodes in the repmgr database. Run the following on the current primary:
405
406

    ```shell
John Jarvis's avatar
John Jarvis committed
407
408
    /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/update-priority.sh
    /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-priority.sh
409
    ```
John Jarvis's avatar
John Jarvis committed
410

411
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql standby instances.
John Jarvis's avatar
John Jarvis committed
412
    * Keep everything, just ensure it’s turned off on the secondaries. The following script will prompt before shutting down postgresql.
413

Ian Baum's avatar
Ian Baum committed
414
    ```shell
John Jarvis's avatar
John Jarvis committed
415
    /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-secondaries.sh
Ian Baum's avatar
Ian Baum committed
416
    ```
John Jarvis's avatar
John Jarvis committed
417

418
1. [ ] 🐘 {+ Database-Wrangler +}: **Gracefully** turn off the **Azure** postgresql primary instance.
John Jarvis's avatar
John Jarvis committed
419
    * Keep everything, just ensure it’s turned off. The following script will prompt before shutting down postgresql.
420
421

    ```shell
John Jarvis's avatar
John Jarvis committed
422
    /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/shutdown-azure-primary.sh
423
    ```
424
1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
John Jarvis's avatar
John Jarvis committed
425

426
     - [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
John Jarvis's avatar
John Jarvis committed
427
428
429
430
    ```shell
    /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/confirm-repmgr.sh
    /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/connect-pgbouncers.sh
    ```
Ian Baum's avatar
Ian Baum committed
431
432
433
434
435
436
437
1. [ ] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
     - [ ] Promote the desired primary

        ```shell
        $ knife ssh "fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby promote"
        ```
     - [ ] Instruct the remaining standby nodes to follow the new primary
438

Ian Baum's avatar
Ian Baum committed
439
440
441
442
         ```shell
         $ knife ssh "role:gstg-base-db-postgres AND NOT fqdn:DESIRED_PRIMARY" "gitlab-ctl repmgr standby follow DESIRED_PRIMARY"
         ```
         *Note*: This will fail on the WAL-E node
443
1. [ ] 🐘 {+ Database-Wrangler +}: Check the database is now read-write
John Jarvis's avatar
John Jarvis committed
444
445
446
    ```bash
    /opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p04/check-gcp-recovery.sh
    ```
Ian Baum's avatar
Ian Baum committed
447
1. [ ] 🔪 {+ Chef-Runner +}: Update the chef configuration according to
448
449
    * Staging: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
    * Production: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218
450
1. [ ] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
Ahmad Sherif's avatar
Ahmad Sherif committed
451
452
    * **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
    * **PRODUCTION** **UNTESTED** `knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
453
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
454
455
    * Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
    * Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
456
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that Unicorn processes have been restarted on all hosts
457
458
    * Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
    * Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
459
1. [ ] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
460
    * This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
461
462
463
    * In a Rails console in GCP:
        * Staging: `GeoNode.where(url: "https://gstg.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
        * Production: `GeoNode.where(url: "https://gprd.gitlab.com/").update_all(url: "https://azure.gitlab.com/")`
464
1. [ ] 🔪 {+ Chef-Runner +}: Flush any unwanted Sidekiq jobs on the promoted secondary
465
    * `Sidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)`
466
467
1. [ ] 🔪 {+ Chef-Runner +}: Clear Redis cache of promoted secondary
    * `Gitlab::Application.load_tasks; Rake::Task['cache:clear:redis'].invoke`
468
469
1. [ ] 🔪 {+ Chef-Runner +}: Start sidekiq in GCP
    * This will automatically re-enable the disabled sidekiq-cron jobs
470
471
    * Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
    * Production: `knife ssh roles:gprd-base-be-sidekiq "sudo gitlab-ctl start sidekiq-cluster"`
472
    [ ] Check that sidekiq processes show up in the GitLab admin panel
473

474

475
476
477
478
479
#### Health check

1. [ ] 🐺 {+ Coordinator +}: Check for any alerts that might have been raised and investigate them
    * Staging: https://alerts.gstg.gitlab.net or #alerts-gstg in Slack
    * Production: https://alerts.gprd.gitlab.net or #alerts-gprd in Slack
480
481
    * The old primary in the GCP environment, backed by WAL-E log shipping, will
      report "replication lag too large" and "unused replication slot". This is OK.
482

483

484
## During-Blackout QA
485

Andrew Newdigate's avatar
Andrew Newdigate committed
486
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
Andrew Newdigate's avatar
Andrew Newdigate committed
487

488
489
The details of the QA tasks are listed in the test plan document.

490
491
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA automated tests have succeeded
- [ ] 🏆 {+ Quality +}: All "during the blackout" QA manual tests have succeeded
492

493
494
495

## Evaluation of QA results - **Decision Point**

496

Andrew Newdigate's avatar
Andrew Newdigate committed
497
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
Andrew Newdigate's avatar
Andrew Newdigate committed
498

499
500
501
502
503
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
failover, or to abort, failing back to Azure. A decision to continue in these
circumstances should be counter-signed by the 🎩 {+ Head Honcho +}.

504
The top priority is to maintain data integrity. Failing back after the blackout
505
506
507
window has ended is very difficult, and will result in any changes made in the
interim being lost.

508
**Don't Panic! [Consult the failover priority list](https://dev.gitlab.org/gitlab-com/migration/blob/master/README.md#failover-priorities)**
509

510
Problems may be categorized into three broad causes - "unknown", "missing data",
511
or "misconfiguration". Testers should focus on determining which bucket
512
513
514
515
516
517
518
519
520
a failure falls into, as quickly as possible.

Failures with an unknown cause should be investigated further. If we can't
determine the root cause within the blackout window, we should fail back.

We should abort for failures caused by missing data unless all the following apply:

* The scope is limited and well-known
* The data is unlikely to be missed in the very short term
521
* A named person will own back-filling the missing data on the same day
522
523
524
525
526
527
528
529
530
531
532

We should abort for failures caused by misconfiguration unless all the following apply:

* The fix is obvious and simple to apply
* The misconfiguration will not cause data loss or corruption before it is corrected
* A named person will own correcting the misconfiguration on the same day

If the number of failures seems high (double digits?), strongly consider failing
back even if they each seem trivial - the causes of each failure may interact in
unexpected ways.

533

534
## Complete the Migration (T plus 2 hours)
535

536

Andrew Newdigate's avatar
Andrew Newdigate committed
537
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
Andrew Newdigate's avatar
Andrew Newdigate committed
538

Andreas Brandl's avatar
Andreas Brandl committed
539
1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
540
541
    1. [ ] `emails_on_push` queue
    1. [ ] `mailers` queue
542
    1. [ ] (`admin_emails` queue doesn't exist any more)
Stan Hu's avatar
Stan Hu committed
543
544
    1. [ ] Rotate the password of the incoming@gitlab.com account and update the vault
    1. [ ] Run chef-client and restart mailroom:
545
        * `bundle exec knife ssh role:gprd-base-be-mailroom 'sudo chef-client; sudo gitlab-ctl restart mailroom'`
546
547
548
1. [ ] 🐺 {+Coordinator+}: **PRODUCTION ONLY** Ensure the secondary can send emails
    1. [ ] Run the following in a Rails console (changing `you` to yourself): `Notify.test_email("you+test@gitlab.com", "Test email", "test").deliver_now`
    1. [ ] Ensure you receive the email
Andrew Newdigate's avatar
Andrew Newdigate committed
549

550

Andrew Newdigate's avatar
Andrew Newdigate committed
551
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
Andrew Newdigate's avatar
Andrew Newdigate committed
552

553
554
555
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
    * In a Rails console, run:
    * `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
Ian Baum's avatar
Ian Baum committed
556
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
557
     - [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
558
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
559
560
561
   - [ ] Update in chef cookbooks by removing the setting entirely
   - [ ] Update in the running database
       - [ ] On the primary server, run `gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'`
562
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Reduce `statement_timeout` to 15s.
563
564
   - [ ] Merge and chef this change: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334
   - [ ] Close https://gitlab.com/gitlab-com/migration/issues/686
565
1. [ ] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
566
567
568
569
570
571
    * Staging:
        * Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
        * `bundle exec knife ssh role:staging-base-lb-pages 'sudo chef-client'`
    * Production:
        * Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
        * `bundle exec knife ssh role:gitlab-base-lb-pages 'sudo chef-client'`
572
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
573
    * Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
574
    * Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322
575

576

Andrew Newdigate's avatar
Andrew Newdigate committed
577
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
Andrew Newdigate's avatar
Andrew Newdigate committed
578

579
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
580
581
582
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Update maintenance status on status.io
    -  https://app.status.io/dashboard/5b36dc6502d06804c08349f7/maintenance/5b50be1e6e3499540bd86e65/edit
    - `GitLab.com  planned maintenance for migration to @GCPcloud is almost complete. GitLab.com is available although we're continuing to verify that all systems are functioning correctly. We're live on YouTube``
583
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
584
    -  `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
Andrew Newdigate's avatar
Andrew Newdigate committed
585

Andrew Newdigate's avatar
Andrew Newdigate committed
586
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
Andrew Newdigate's avatar
Andrew Newdigate committed
587
588
589
590

1. **Start After-Blackout QA** This is the second half of the test plan.
    1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
    1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA manual tests have succeeded
591

592

593
## **PRODUCTION ONLY** Post migration
594

595
1. [ ] 🐺 {+ Coordinator +}: Close the failback issue - it isn't needed
596
1. [ ] ☁ {+ Cloud-conductor +}: Disable unneeded resources in the Azure environment
597
 completion more effectively
598
599
    * The Pages LB proxy must be retained
    * We should retain all filesystem data for a defined period in case of problems (1 week? 3 months?)
600
    * Unused machines can be switched off
601
1. [ ] ☁ {+ Cloud-conductor +}: Change GitLab settings: [https://gprd.gitlab.com/admin/application_settings](https://gprd.gitlab.com/admin/application_settings)
Stan Hu's avatar
Stan Hu committed
602
    * Metrics - Influx -> InfluxDB host should be `performance-01-inf-gprd.c.gitlab-production.internal`
603

604

605
/label ~"Failover Execution"