preflight_checks.md 13 KB
Newer Older
1 2
# Pre-flight checks

John Jarvis's avatar
John Jarvis committed
3 4
## Dashboards and Alerts

5
1. [ ] 🐺 {+Coordinator+}: Ensure that there are no active alerts on the azure or gcp environment.
John Jarvis's avatar
John Jarvis committed
6
    - Staging
7
        - GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
digitalMoksha's avatar
digitalMoksha committed
8
        - Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
9 10
    - Production
        - GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
digitalMoksha's avatar
digitalMoksha committed
11
        - Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
12 13 14 15 16 17 18 19
1. [ ] 🐺 {+Coordinator+}: Review the failover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
    - Staging
        - GCP `gstg`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
        - Azure Staging: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
    - Production
        - GCP `gprd`: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
        - Azure Production: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd

John Jarvis's avatar
John Jarvis committed
20

John Jarvis's avatar
John Jarvis committed
21
## GitLab Version and CDN Checks
22

23
1. [ ] 🐺 {+Coordinator+}: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for `db` nodes (`tier` == `db`) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
24
    - Versions can be confirmed using the Omnibus version tracker dashboards:
25 26 27 28 29 30
        - Staging
            - GCP `gstg`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gstg
            - Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
        - Production
            - GCP `gprd`: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gprd
            - Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd
31

John Jarvis's avatar
John Jarvis committed
32 33 34
1. [ ] 🐺 {+Coordinator+}: Ensure that the fastly CDN ip ranges are up-to-date.
    - Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
        - Staging
Nick Thomas's avatar
Nick Thomas committed
35
            - GCP `gstg`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gstg-base-lb-fe.json#L48
John Jarvis's avatar
John Jarvis committed
36
        - Production
Brett Walker's avatar
Brett Walker committed
37
            - GCP `gprd`: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gprd-base-lb-fe.json#L56
John Jarvis's avatar
John Jarvis committed
38

39

40 41
## Object storage

42 43
1. [ ] 🐺 {+Coordinator+}: Ensure primary and secondary share the same object storage configuration. For each line below,
		execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console,
Brett Walker's avatar
Brett Walker committed
44 45 46 47 48
		appending `==`, and pasting the results from the primary console.  You should get a `true` or `false` value.
    1. [ ] `Gitlab.config.uploads`
    1. [ ] `Gitlab.config.lfs`
    1. [ ] `Gitlab.config.artifacts`
1. [ ] 🐺 {+Coordinator+}: Ensure all artifacts and LFS objects are in object storage
49 50
    * If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
    * On staging, these numbers are non-zero. Just mark as checked.
Brett Walker's avatar
Brett Walker committed
51
    1. [ ] `Upload.with_files_stored_locally.count` # => 0
52
    1. [ ] `LfsObject.with_files_stored_locally.count` # => 13 (there are a small number of known-lost LFS objects)
Brett Walker's avatar
Brett Walker committed
53
    1. [ ] `Ci::JobArtifact.with_files_stored_locally.count` # => 0
54

55

56 57
## Pre-migrated services

Brett Walker's avatar
Brett Walker committed
58
1. [ ] 🐺 {+Coordinator+}: Check that the container registry has been [pre-migrated to GCP](https://gitlab.com/gitlab-com/migration/issues/466)
59 60


61 62
## Configuration checks

Brett Walker's avatar
Brett Walker committed
63
1. [ ] 🐺 {+Coordinator+}: Ensure `gitlab-rake gitlab:geo:check` reports no errors on the primary or secondary
64
    * A warning may be output regarding `AuthorizedKeysCommand`. This is OK, and tracked in [infrastructure#4280](https://gitlab.com/gitlab-com/infrastructure/issues/4280).
Nick Thomas's avatar
Nick Thomas committed
65
1. Compare some files on a representative node (a web worker) between primary and secondary:
66 67
    1. [ ] Manually compare the diff of `/etc/gitlab/gitlab.rb`
    1. [ ] Manually compare the diff of `/etc/gitlab/gitlab-secrets.json`
Brett Walker's avatar
Brett Walker committed
68
1. [ ] 🐺 {+Coordinator+}: Check SSH host keys match
69 70
    * Staging:
        - [ ] `bin/compare-host-keys staging.gitlab.com gstg.gitlab.com`
71
        - [ ] `SSH_PORT=443 bin/compare-host-keys altssh.staging.gitlab.com altssh.gstg.gitlab.com`
72 73
    * Production:
        - [ ] `bin/compare-host-keys gitlab.com gprd.gitlab.com`
74
        - [ ] `SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com`
Brett Walker's avatar
Brett Walker committed
75
1. [ ] 🐺 {+Coordinator+}: Ensure repository and wiki verification feature flag shows as enabled on both **primary** and **secondary**
76
    * `Feature.enabled?(:geo_repository_verification)`
77 78 79 80
1. [ ] 🐺 {+Coordinator+}: Ensure the TTL for affected DNS records is low
    * 300 seconds is fine
    * Staging:
        - [ ] `staging.gitlab.com`
81
        - [ ] `altssh.staging.gitlab.com`
82
        - [ ] `gitlab-org.staging.gitlab.io`
83 84 85
    * Production:
        - [ ] `gitlab.com`
        - [ ] `altssh.gitlab.com`
86
        - [ ] `gitlab-org.gitlab.io`
Brett Walker's avatar
Brett Walker committed
87
1. [ ] 🐺 {+Coordinator+}: Ensure SSL configuration on the secondary is valid for primary domain names too
88
    * Handy script in the migration repository: `bin/check-ssl <hostname>:<port>`
89 90 91 92 93 94
    * Staging:
        - [ ] `bin/check-ssl gstg.gitlab.com:443`
        - [ ] `bin/check-ssl gitlab-org.gstg.gitlab.io:443`
    * Production:
        - [ ] `bin/check-ssl gprd.gitlab.com:443`
        - [ ] `bin/check-ssl gitlab-org.gprd.gitlab.io:443`
95 96
1. [ ] 🔪 {+Chef-Runner+}: Ensure SSH connectivity to all hosts, including host key verification
    * `chef-client role:gitlab-base pwd`
Brett Walker's avatar
Brett Walker committed
97
1. [ ] 🔪 {+Chef-Runner+}: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
Stan Hu's avatar
Stan Hu committed
98
    1. [ ] `bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'`
Brett Walker's avatar
Brett Walker committed
99
1. [ ] 🔪 {+Chef-Runner+}: Ensure that mailroom nodes have been configured with the right roles:
Stan Hu's avatar
Stan Hu committed
100 101
    * Staging: `bundle exec knife ssh "role:gstg-base-be-mailroom" hostname`
    * Production: `bundle exec knife ssh "role:gprd-base-be-mailroom" hostname`
102 103 104 105 106 107 108 109 110
1. [ ] 🔪 {+ Chef-Runner +}: Ensure all hot-patches are applied to the target environment:
    1. Fetch the latest version of [post-deployment-patches](https://dev.gitlab.org/gitlab/post-deployment-patches/)
    1. Check the omnibus version running in the target environment
         * Staging: `knife role show gstg-omnibus-version | grep version:`
         * Production: `knife role show gprd-omnibus-version | grep version:`
    1. In `post-deployment-patches`, ensure that the version maninfest has a corresponding GCP Chef role under the target environment
         * E.g. In `11.1/MANIFEST.yml`, `versions.11.1.0-rc10-ee.environments.staging` should have `gstg-base-fe-api` along with `staging-base-fe-api`
    1. Run `gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod`
         * The command can fail because the patches may have already been applied, that's OK.
Brett Walker's avatar
Brett Walker committed
111
1. [ ] 🔪 {+Chef-Runner+}: Outstanding merge requests are up to date vs. `master`:
112
    * Staging:
113
        * [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094)
114 115
        * [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029)
        * [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989)
116
        * [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270)
117
        * [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112)
118
        * [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2333)
119
    * Production:
120 121
        * [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243)
        * [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254)
John Jarvis's avatar
John Jarvis committed
122
        * [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218)
Brett Walker's avatar
Brett Walker committed
123
        * [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987)
124
        * [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322)
125 126
        * [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)

127
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
128

129

130 131
## Ensure Geo replication is up to date

Brett Walker's avatar
Brett Walker committed
132
1. [ ] 🐺 {+Coordinator+}: Ensure database replication is healthy and up to date
133 134
    * Create a test issue on the primary and wait for it to appear on the secondary
    * This should take less than 5 minutes at most
135 136 137
1. [ ] 🐺 {+Coordinator+}: Ensure sidekiq is healthy
    * `Busy` + `Enqueued` + `Retries` should total less than 10,000, with fewer than 100 retries
    * `Scheduled` jobs should not be present, or should all be scheduled to be run before the failover starts
138
    * Staging: https://staging.gitlab.com/admin/background_jobs
139
    * Production: https://gitlab.com/admin/background_jobs
140 141 142
    * From a rails console: `Sidekiq::Stats.new`
    * "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
    * "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
Brett Walker's avatar
Brett Walker committed
143
1. [ ] 🐺 {+Coordinator+}: Ensure **repositories** and **wikis** are at least 99% complete, 0 failed (that’s zero, not 0%):
144 145 146
    * Staging: https://staging.gitlab.com/admin/geo_nodes
    * Production: https://gitlab.com/admin/geo_nodes
    * Observe the "Sync Information" tab for the secondary
147
    * See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
148
    * Staging: some failures and unsynced repositories are expected
Brett Walker's avatar
Brett Walker committed
149
1. [ ] 🐺 {+Coordinator+}: Local **CI artifacts**, **LFS objects** and **Uploads** should have 0 in all columns
150 151
    * Staging: some failures and unsynced files are expected
    * Production: this may fluctuate around 0 due to background upload. This is OK.
Brett Walker's avatar
Brett Walker committed
152
1. [ ] 🐺 {+Coordinator+}: Ensure Geo event log is being processed
153
    * In a rails console for both primary and secondary: `Geo::EventLog.maximum(:id)`
Nick Thomas's avatar
Nick Thomas committed
154
        * This may be `nil`. If so, perform a `git push` to a random project to generate a new event
155 156
    * In a rails console for the secondary: `Geo::EventLogState.last_processed`
    * All numbers should be within 10,000 of each other.
157

158

159 160
## Verify the integrity of replicated repositories and wikis

Brett Walker's avatar
Brett Walker committed
161
1. [ ] 🐺 {+Coordinator+}: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
162 163
    * Staging: https://gstg.gitlab.com/admin/geo_nodes
    * Production: https://gprd.gitlab.com/admin/geo_nodes
Nick Thomas's avatar
Nick Thomas committed
164 165
    * Review the numbers under the `Verification Information` tab for the
      **secondary** node
166
    * If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
167
1. No need to verify the integrity of anything in object storage
168

169

170 171 172 173 174
## Perform an automated QA run against the current infrastructure

1. [ ] 🏆 {+ Quality +}: Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
1. [ ] 🏆 {+ Quality +}: Post the result in the test plan issue. This will be used as the yardstick to compare the "During failover" automated QA run against.

175 176
## Schedule the failover

177
1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +}, 🏆 {+ Quality +}, and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
Brett Walker's avatar
Brett Walker committed
178 179 180 181 182 183
1. [ ] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work.
1. [ ] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "test plan" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=test_plan)
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failback" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failback)
1. [ ] 🐺 {+Coordinator+}: Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues
Andrew Newdigate's avatar
Andrew Newdigate committed
184

185

186
/label ~"Failover Execution"