2018-07-31 STAGING failover attempt: preflight checks

Pre-flight checks

Dashboards and Alerts

🐺 Coordinator: Ensure that there are no active alerts on the azure or gcp environment.
- Staging
  - GCP gstg: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
  - Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
- Production
  - GCP gprd: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
  - Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
🐺 Coordinator: Review the failover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
- Staging
  - GCP gstg: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
  - Azure Staging: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
- Production
  - GCP gprd: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
  - Azure Production: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd

GitLab Version and CDN Checks

🐺 Coordinator: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for db nodes (tier == db) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
- Versions can be confirmed using the Omnibus version tracker dashboards:
  - Staging
    - GCP gstg: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gstg
    - Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
  - Production
    - GCP gprd: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gprd
    - Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd
🐺 Coordinator: Ensure that the fastly CDN ip ranges are up-to-date.
- Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
  - Staging
    - GCP gstg: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gstg-base-lb-fe.json#L48
  - Production
    - GCP gprd: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gprd-base-lb-fe.json#L48

Object storage

🐺 Coordinator: Ensure primary and secondary share the same object storage configuration. For each line below, execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console, appending ==, and pasting the results from the primary console. You should get a true or false value.
1. Gitlab.config.uploads
2. Gitlab.config.lfs
3. Gitlab.config.artifacts
🐺 Coordinator: Ensure all artifacts and LFS objects are in object storage
- If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
- On staging, these numbers are non-zero. Just mark as checked.
1. Upload.with_files_stored_locally.count # => 0
2. LfsObject.with_files_stored_locally.count # => 13 (there are a small number of known-lost LFS objects)
3. Ci::JobArtifact.with_files_stored_locally.count # => 0

Pre-migrated services

🐺 Coordinator: Check that the container registry has been pre-migrated to GCP

Configuration checks

Ensure Geo replication is up to date

🐺 Coordinator: Ensure database replication is healthy and up to date
- Create a test issue on the primary and wait for it to appear on the secondary
- This should take less than 5 minutes at most
🐺 Coordinator: Ensure sidekiq is healthy
- Busy + Enqueued + Retries should total less than 10,000, with fewer than 100 retries
- Scheduled jobs should not be present, or should all be scheduled to be run before the failover starts
- Staging: https://staging.gitlab.com/admin/background_jobs
- Production: https://gitlab.com/admin/background_jobs
- From a rails console: Sidekiq::Stats.new
- "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
- "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
🐺 Coordinator: Ensure repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%):
- Staging: https://staging.gitlab.com/admin/geo_nodes
- Production: https://gitlab.com/admin/geo_nodes
- Observe the "Sync Information" tab for the secondary
- See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
- Staging: some failures and unsynced repositories are expected
🐺 Coordinator: Local CI artifacts, LFS objects and Uploads should have 0 in all columns
- Staging: some failures and unsynced files are expected
- Production: this may fluctuate around 0 due to background upload. This is OK.
🐺 Coordinator: Ensure Geo event log is being processed
- In a rails console for both primary and secondary: Geo::EventLog.maximum(:id)
  - This may be nil. If so, perform a git push to a random project to generate a new event
- In a rails console for the secondary: Geo::EventLogState.last_processed
- All numbers should be within 10,000 of each other.

Verify the integrity of replicated repositories and wikis

🐺 Coordinator: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
- Staging: https://gstg.gitlab.com/admin/geo_nodes
- Production: https://gprd.gitlab.com/admin/geo_nodes
- Review the numbers under the Verification Information tab for the secondary node
- If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
No need to verify the integrity of anything in object storage

Perform an automated QA run against the current infrastructure

🏆 Quality : Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
🏆 Quality : Post the result in the test plan issue. This will be used as the yardstick to compare the "During failover" automated QA run against.

Schedule the failover

🐺 Coordinator: Ask the 🔪 Chef-Runner and 🐘 Database-Wrangler to perform their preflight tasks
🐺 Coordinator: Pick a date and time for the failover itself that won't interfere with the release team's work.
🐺 Coordinator: Verify with RMs for the next release that the chosen date is OK
🐺 Coordinator: Create a new issue in the tracker using the "failover" template
🐺 Coordinator: Create a new issue in the tracker using the "test plan" template
🐺 Coordinator: Create a new issue in the tracker using the "failback" template
🐺 Coordinator: Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues

Edited Jul 31, 2018 by Rémy Coutable