preflight_checks.md 13.2 KB
Newer Older
1
2
# Pre-flight checks

John Jarvis's avatar
John Jarvis committed
3
4
## Dashboards and Alerts

5
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
John Jarvis's avatar
John Jarvis committed
6
    - Staging
7
        - GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
digitalMoksha's avatar
digitalMoksha committed
8
        - Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
9
10
    - Production
        - GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
digitalMoksha's avatar
digitalMoksha committed
11
        - Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
12
13
14
15
16
17
18
19
1. [ ] 🐺 {+Coordinator+}: Review the failover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
    - Staging
        - GCP `gstg`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
        - Azure Staging: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
    - Production
        - GCP `gprd`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
        - Azure Production: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd

John Jarvis's avatar
John Jarvis committed
20

John Jarvis's avatar
John Jarvis committed
21
## GitLab Version and CDN Checks
22

23
1. [ ] 🐺 {+Coordinator+}: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for `db` nodes (`tier` == `db`) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
24
    - Versions can be confirmed using the Omnibus version tracker dashboards:
25
26
27
28
29
30
        - Staging
            - GCP `gstg`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gstg
            - Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
        - Production
            - GCP `gprd`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gprd
            - Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd
31

John Jarvis's avatar
John Jarvis committed
32
33
34
1. [ ] 🐺 {+Coordinator+}: Ensure that the fastly CDN ip ranges are up-to-date.
    - Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
        - Staging
Nick Thomas's avatar
Nick Thomas committed
35
            - GCP `gstg`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gstg-base-lb-fe.json#L48
John Jarvis's avatar
John Jarvis committed
36
        - Production
Brett Walker's avatar
Brett Walker committed
37
            - GCP `gprd`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gprd-base-lb-fe.json#L56
John Jarvis's avatar
John Jarvis committed
38

39

40
41
## Object storage

42
43
1. [ ] 🐺 {+Coordinator+}: Ensure primary and secondary share the same object storage configuration. For each line below,
		execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console,
Brett Walker's avatar
Brett Walker committed
44
45
46
47
48
		appending `==`, and pasting the results from the primary console.  You should get a `true` or `false` value.
    1. [ ] `Gitlab.config.uploads`
    1. [ ] `Gitlab.config.lfs`
    1. [ ] `Gitlab.config.artifacts`
1. [ ] 🐺 {+Coordinator+}: Ensure all artifacts and LFS objects are in object storage
49
50
    * If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
    * On staging, these numbers are non-zero. Just mark as checked.
Brett Walker's avatar
Brett Walker committed
51
    1. [ ] `Upload.with_files_stored_locally.count` # => 0
52
    1. [ ] `LfsObject.with_files_stored_locally.count` # => 13 (there are a small number of known-lost LFS objects)
Brett Walker's avatar
Brett Walker committed
53
    1. [ ] `Ci::JobArtifact.with_files_stored_locally.count` # => 0
54

55

56
57
## Pre-migrated services

Brett Walker's avatar
Brett Walker committed
58
1. [ ] 🐺 {+Coordinator+}: Check that the container registry has been [pre-migrated to GCP](https://gitlab.com/gitlab-com/migration/issues/466)
59
60


61
62
## Configuration checks

Brett Walker's avatar
Brett Walker committed
63
1. [ ] 🐺 {+Coordinator+}: Ensure `gitlab-rake gitlab:geo:check` reports no errors on the primary or secondary
64
    * A warning may be output regarding `AuthorizedKeysCommand`. This is OK, and tracked in [infrastructure#4280](https://gitlab.com/gitlab-com/infrastructure/issues/4280).
Nick Thomas's avatar
Nick Thomas committed
65
1. Compare some files on a representative node (a web worker) between primary and secondary:
66
67
    1. [ ] Manually compare the diff of `/etc/gitlab/gitlab.rb`
    1. [ ] Manually compare the diff of `/etc/gitlab/gitlab-secrets.json`
Brett Walker's avatar
Brett Walker committed
68
1. [ ] 🐺 {+Coordinator+}: Check SSH host keys match
69
70
    * Staging:
        - [ ] `bin/compare-host-keys staging.gitlab.com gstg.gitlab.com`
71
        - [ ] `SSH_PORT=443 bin/compare-host-keys altssh.staging.gitlab.com altssh.gstg.gitlab.com`
72
73
    * Production:
        - [ ] `bin/compare-host-keys gitlab.com gprd.gitlab.com`
74
        - [ ] `SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com`
Brett Walker's avatar
Brett Walker committed
75
1. [ ] 🐺 {+Coordinator+}: Ensure repository and wiki verification feature flag shows as enabled on both **primary** and **secondary**
76
    * `Feature.enabled?(:geo_repository_verification)`
77
78
79
80
1. [ ] 🐺 {+Coordinator+}: Ensure the TTL for affected DNS records is low
    * 300 seconds is fine
    * Staging:
        - [ ] `staging.gitlab.com`
81
        - [ ] `altssh.staging.gitlab.com`
82
        - [ ] `gitlab-org.staging.gitlab.io`
83
84
85
    * Production:
        - [ ] `gitlab.com`
        - [ ] `altssh.gitlab.com`
86
        - [ ] `gitlab-org.gitlab.io`
Brett Walker's avatar
Brett Walker committed
87
1. [ ] 🐺 {+Coordinator+}: Ensure SSL configuration on the secondary is valid for primary domain names too
88
    * Handy script in the migration repository: `bin/check-ssl <hostname>:<port>`
89
90
91
92
93
94
    * Staging:
        - [ ] `bin/check-ssl gstg.gitlab.com:443`
        - [ ] `bin/check-ssl gitlab-org.gstg.gitlab.io:443`
    * Production:
        - [ ] `bin/check-ssl gprd.gitlab.com:443`
        - [ ] `bin/check-ssl gitlab-org.gprd.gitlab.io:443`
95
96
1. [ ] 🔪 {+Chef-Runner+}: Ensure SSH connectivity to all hosts, including host key verification
    * `chef-client role:gitlab-base pwd`
Brett Walker's avatar
Brett Walker committed
97
1. [ ] 🔪 {+Chef-Runner+}: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
Stan Hu's avatar
Stan Hu committed
98
    1. [ ] `bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'`
Brett Walker's avatar
Brett Walker committed
99
1. [ ] 🔪 {+Chef-Runner+}: Ensure that mailroom nodes have been configured with the right roles:
Stan Hu's avatar
Stan Hu committed
100
101
    * Staging: `bundle exec knife ssh "role:gstg-base-be-mailroom" hostname`
    * Production: `bundle exec knife ssh "role:gprd-base-be-mailroom" hostname`
102
103
104
105
106
107
108
109
110
1. [ ] 🔪 {+ Chef-Runner +}: Ensure all hot-patches are applied to the target environment:
    1. Fetch the latest version of [post-deployment-patches](https://dev.gitlab.org/gitlab/post-deployment-patches/)
    1. Check the omnibus version running in the target environment
         * Staging: `knife role show gstg-omnibus-version | grep version:`
         * Production: `knife role show gprd-omnibus-version | grep version:`
    1. In `post-deployment-patches`, ensure that the version maninfest has a corresponding GCP Chef role under the target environment
         * E.g. In `11.1/MANIFEST.yml`, `versions.11.1.0-rc10-ee.environments.staging` should have `gstg-base-fe-api` along with `staging-base-fe-api`
    1. Run `gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod`
         * The command can fail because the patches may have already been applied, that's OK.
Brett Walker's avatar
Brett Walker committed
111
1. [ ] 🔪 {+Chef-Runner+}: Outstanding merge requests are up to date vs. `master`:
112
    * Staging:
113
        * [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094)
114
115
        * [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029)
        * [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989)
116
        * [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270)
117
        * [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112)
118
        * [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2333)
119
    * Production:
120
121
        * [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243)
        * [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254)
John Jarvis's avatar
John Jarvis committed
122
        * [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218)
Brett Walker's avatar
Brett Walker committed
123
        * [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987)
124
        * [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322)
125
126
        * [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)

127
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
128

129

130
131
## Ensure Geo replication is up to date

Brett Walker's avatar
Brett Walker committed
132
1. [ ] 🐺 {+Coordinator+}: Ensure database replication is healthy and up to date
133
134
    * Create a test issue on the primary and wait for it to appear on the secondary
    * This should take less than 5 minutes at most
135
136
137
1. [ ] 🐺 {+Coordinator+}: Ensure sidekiq is healthy
    * `Busy` + `Enqueued` + `Retries` should total less than 10,000, with fewer than 100 retries
    * `Scheduled` jobs should not be present, or should all be scheduled to be run before the failover starts
138
    * Staging: https://staging.gitlab.com/admin/background_jobs
139
    * Production: https://gitlab.com/admin/background_jobs
140
141
142
    * From a rails console: `Sidekiq::Stats.new`
    * "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
    * "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
Brett Walker's avatar
Brett Walker committed
143
1. [ ] 🐺 {+Coordinator+}: Ensure **repositories** and **wikis** are at least 99% complete, 0 failed (that’s zero, not 0%):
144
145
146
    * Staging: https://staging.gitlab.com/admin/geo_nodes
    * Production: https://gitlab.com/admin/geo_nodes
    * Observe the "Sync Information" tab for the secondary
147
    * See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
148
    * Staging: some failures and unsynced repositories are expected
Brett Walker's avatar
Brett Walker committed
149
1. [ ] 🐺 {+Coordinator+}: Local **CI artifacts**, **LFS objects** and **Uploads** should have 0 in all columns
150
151
    * Staging: some failures and unsynced files are expected
    * Production: this may fluctuate around 0 due to background upload. This is OK.
Brett Walker's avatar
Brett Walker committed
152
1. [ ] 🐺 {+Coordinator+}: Ensure Geo event log is being processed
153
    * In a rails console for both primary and secondary: `Geo::EventLog.maximum(:id)`
Nick Thomas's avatar
Nick Thomas committed
154
        * This may be `nil`. If so, perform a `git push` to a random project to generate a new event
155
156
    * In a rails console for the secondary: `Geo::EventLogState.last_processed`
    * All numbers should be within 10,000 of each other.
157
158
1. [ ] 🐺 {+ Coordinator +}: Reconcile negative registry entries
    * Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md
159

160
161
## Verify the integrity of replicated repositories and wikis

Brett Walker's avatar
Brett Walker committed
162
1. [ ] 🐺 {+Coordinator+}: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
163
164
    * Staging: https://gstg.gitlab.com/admin/geo_nodes
    * Production: https://gprd.gitlab.com/admin/geo_nodes
Nick Thomas's avatar
Nick Thomas committed
165
166
    * Review the numbers under the `Verification Information` tab for the
      **secondary** node
167
    * If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
168
1. No need to verify the integrity of anything in object storage
169

170

171
172
173
174
175
## Perform an automated QA run against the current infrastructure

1. [ ] 🏆 {+ Quality +}: Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
1. [ ] 🏆 {+ Quality +}: Post the result in the test plan issue. This will be used as the yardstick to compare the "During failover" automated QA run against.

176
177
## Schedule the failover

178
1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +}, 🏆 {+ Quality +}, and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
Brett Walker's avatar
Brett Walker committed
179
180
181
182
183
184
1. [ ] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work.
1. [ ] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "test plan" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=test_plan)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failback" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failback)
1. [ ] 🐺 {+Coordinator+}: Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues
Andrew Newdigate's avatar
Andrew Newdigate committed
185

186

187
/label ~"Failover Execution"