2018-07-31 STAGING failover attempt: preflight checks
Pre-flight checks
Dashboards and Alerts
-
🐺 Coordinator: Ensure that there are no active alerts on the azure or gcp environment.- Staging
- GCP
gstg
: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg - Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
- GCP
- Production
- GCP
gprd
: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd - Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
- GCP
- Staging
-
🐺 Coordinator: Review the failover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)- Staging
- Production
GitLab Version and CDN Checks
-
🐺 Coordinator: Ensure that both sides to be running the same minor version. It's ok if the minor version differs fordb
nodes (tier
==db
) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way- Versions can be confirmed using the Omnibus version tracker dashboards:
- Staging
- Production
- Versions can be confirmed using the Omnibus version tracker dashboards:
-
🐺 Coordinator: Ensure that the fastly CDN ip ranges are up-to-date.- Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
Object storage
-
🐺 Coordinator: Ensure primary and secondary share the same object storage configuration. For each line below, execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console, appending==
, and pasting the results from the primary console. You should get atrue
orfalse
value.-
Gitlab.config.uploads
-
Gitlab.config.lfs
-
Gitlab.config.artifacts
-
-
🐺 Coordinator: Ensure all artifacts and LFS objects are in object storage- If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
- On staging, these numbers are non-zero. Just mark as checked.
-
Upload.with_files_stored_locally.count
# => 0 -
LfsObject.with_files_stored_locally.count
# => 13 (there are a small number of known-lost LFS objects) -
Ci::JobArtifact.with_files_stored_locally.count
# => 0
Pre-migrated services
-
🐺 Coordinator: Check that the container registry has been pre-migrated to GCP
Configuration checks
-
🐺 Coordinator: Ensuregitlab-rake gitlab:geo:check
reports no errors on the primary or secondary- A warning may be output regarding
AuthorizedKeysCommand
. This is OK, and tracked in infrastructure#4280.
- A warning may be output regarding
-
Compare some files on a representative node (a web worker) between primary and secondary:
-
Manually compare the diff of /etc/gitlab/gitlab.rb
-
Manually compare the diff of /etc/gitlab/gitlab-secrets.json
-
-
🐺 Coordinator: Check SSH host keys match- Staging:
-
bin/compare-host-keys staging.gitlab.com gstg.gitlab.com
-
SSH_PORT=443 bin/compare-host-keys altssh.staging.gitlab.com altssh.gstg.gitlab.com
-
- Production:
-
bin/compare-host-keys gitlab.com gprd.gitlab.com
-
SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com
-
- Staging:
-
🐺 Coordinator: Ensure repository and wiki verification feature flag shows as enabled on both primary and secondaryFeature.enabled?(:geo_repository_verification)
-
🐺 Coordinator: Ensure the TTL for affected DNS records is low- 300 seconds is fine
- Staging:
-
staging.gitlab.com
-
altssh.staging.gitlab.com
-
gitlab-org.staging.gitlab.io
-
- Production:
-
gitlab.com
-
altssh.gitlab.com
-
gitlab-org.gitlab.io
-
-
🐺 Coordinator: Ensure SSL configuration on the secondary is valid for primary domain names too- Handy script in the migration repository:
bin/check-ssl <hostname>:<port>
- Staging:
-
bin/check-ssl gstg.gitlab.com:443
-
bin/check-ssl gitlab-org.gstg.gitlab.io:443
-
- Production:
-
bin/check-ssl gprd.gitlab.com:443
-
bin/check-ssl gitlab-org.gprd.gitlab.io:443
-
- Handy script in the migration repository:
-
🔪 Chef-Runner: Ensure SSH connectivity to all hosts, including host key verificationchef-client role:gitlab-base pwd
-
🔪 Chef-Runner: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:-
bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'
-
-
🔪 Chef-Runner: Ensure that mailroom nodes have been configured with the right roles:- Staging:
bundle exec knife ssh "role:gstg-base-be-mailroom" hostname
- Production:
bundle exec knife ssh "role:gprd-base-be-mailroom" hostname
- Staging:
-
🔪 Chef-Runner : Ensure all hot-patches are applied to the target environment:- Fetch the latest version of post-deployment-patches
- Check the omnibus version running in the target environment
- Staging:
knife role show gstg-omnibus-version | grep version:
- Production:
knife role show gprd-omnibus-version | grep version:
- Staging:
- In
post-deployment-patches
, ensure that the version maninfest has a corresponding GCP Chef role under the target environment- E.g. In
11.1/MANIFEST.yml
,versions.11.1.0-rc10-ee.environments.staging
should havegstg-base-fe-api
along withstaging-base-fe-api
- E.g. In
- Run
gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod
- The command can fail because the patches may have already been applied, that's OK.
-
🔪 Chef-Runner: Outstanding merge requests are up to date vs.master
:- Staging:
- Production:
-
🐘 Database-Wrangler : Ensuregitlab-ctl repmgr cluster show
works on all database nodes
Ensure Geo replication is up to date
-
🐺 Coordinator: Ensure database replication is healthy and up to date- Create a test issue on the primary and wait for it to appear on the secondary
- This should take less than 5 minutes at most
-
🐺 Coordinator: Ensure sidekiq is healthy-
Busy
+Enqueued
+Retries
should total less than 10,000, with fewer than 100 retries -
Scheduled
jobs should not be present, or should all be scheduled to be run before the failover starts - Staging: https://staging.gitlab.com/admin/background_jobs
- Production: https://gitlab.com/admin/background_jobs
- From a rails console:
Sidekiq::Stats.new
- "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
- "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
-
-
🐺 Coordinator: Ensure repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%):- Staging: https://staging.gitlab.com/admin/geo_nodes
- Production: https://gitlab.com/admin/geo_nodes
- Observe the "Sync Information" tab for the secondary
- See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
- Staging: some failures and unsynced repositories are expected
-
🐺 Coordinator: Local CI artifacts, LFS objects and Uploads should have 0 in all columns- Staging: some failures and unsynced files are expected
- Production: this may fluctuate around 0 due to background upload. This is OK.
-
🐺 Coordinator: Ensure Geo event log is being processed- In a rails console for both primary and secondary:
Geo::EventLog.maximum(:id)
- This may be
nil
. If so, perform agit push
to a random project to generate a new event
- This may be
- In a rails console for the secondary:
Geo::EventLogState.last_processed
- All numbers should be within 10,000 of each other.
- In a rails console for both primary and secondary:
Verify the integrity of replicated repositories and wikis
-
🐺 Coordinator: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):- Staging: https://gstg.gitlab.com/admin/geo_nodes
- Production: https://gprd.gitlab.com/admin/geo_nodes
- Review the numbers under the
Verification Information
tab for the secondary node - If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
- No need to verify the integrity of anything in object storage
Perform an automated QA run against the current infrastructure
-
🏆 Quality : Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue -
🏆 Quality : Post the result in the test plan issue. This will be used as the yardstick to compare the "During failover" automated QA run against.
Schedule the failover
-
🐺 Coordinator: Ask the🔪 Chef-Runner and🐘 Database-Wrangler to perform their preflight tasks -
🐺 Coordinator: Pick a date and time for the failover itself that won't interfere with the release team's work. -
🐺 Coordinator: Verify with RMs for the next release that the chosen date is OK -
🐺 Coordinator: Create a new issue in the tracker using the "failover" template -
🐺 Coordinator: Create a new issue in the tracker using the "test plan" template -
🐺 Coordinator: Create a new issue in the tracker using the "failback" template -
🐺 Coordinator: Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues