You need to sign in or sign up before continuing.

2018-08-11 PRODUCTION switchover attempt: preflight checks

Pre-flight checks

Dashboards and Alerts

🐺 Coordinator: Ensure that there are no active alerts on the azure or gcp environment.
- Production
  - GCP gprd: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
  - Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
🐺 Coordinator: Review the switchover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)
- Production
  - GCP gprd: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
  - Azure Production: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd

GitLab Version and CDN Checks

🐺 Coordinator: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for db nodes (tier == db) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way
- Versions can be confirmed using the Omnibus version tracker dashboards:
  - Production
    - GCP gprd: https://dashboards.gitlab.net/d/CRNfDC7mk/gitlab-omnibus-versions?refresh=5m&orgId=1&var-environment=gprd
    - Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd
🐺 Coordinator: Ensure that the fastly CDN ip ranges are up-to-date.
- Check the following chef roles against the official ip list https://api.fastly.com/public-ip-list
  - Production
    - GCP gprd: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gprd-base-lb-fe.json#L56

Object storage

🐺 Coordinator: Ensure primary and secondary share the same object storage configuration. For each line below, execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console, appending ==, and pasting the results from the primary console. You should get a true or false value.
1. Gitlab.config.uploads
2. Gitlab.config.lfs
3. Gitlab.config.artifacts
🐺 Coordinator: Ensure all artifacts and LFS objects are in object storage
- If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
1. Upload.with_files_stored_locally.count # => 0
2. LfsObject.with_files_stored_locally.count # => 13 (there are a small number of known-lost LFS objects)
3. Ci::JobArtifact.with_files_stored_locally.count # => 0

Pre-migrated services

🐺 Coordinator: Check that the container registry has been pre-migrated to GCP

Configuration checks

Ensure Geo replication is up to date

🐺 Coordinator: Ensure database replication is healthy and up to date
- Create a test issue on the primary and wait for it to appear on the secondary
- This should take less than 5 minutes at most
🐺 Coordinator: Ensure sidekiq is healthy
- Busy + Enqueued + Retries should total less than 10,000, with fewer than 100 retries
- Scheduled jobs should not be present, or should all be scheduled to be run before the switchover starts
- Production: https://gitlab.com/admin/background_jobs
- From a rails console: Sidekiq::Stats.new
- "Dead" jobs will be lost on switchover but can be ignored as we routinely ignore them
- "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
🐺 Coordinator: Ensure repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%):
- Production: https://gitlab.com/admin/geo_nodes
- Observe the "Sync Information" tab for the secondary
- See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
🐺 Coordinator: Local CI artifacts, LFS objects and Uploads should have 0 in all columns
- Production: this may fluctuate around 0 due to background upload. This is OK.
🐺 Coordinator: Ensure Geo event log is being processed
- In a rails console for both primary and secondary: Geo::EventLog.maximum(:id)
  - This may be nil. If so, perform a git push to a random project to generate a new event
- In a rails console for the secondary: Geo::EventLogState.last_processed
- All numbers should be within 10,000 of each other.
🐺 Coordinator : Reconcile negative registry entries
- Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md

Verify the integrity of replicated repositories and wikis

🐺 Coordinator: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
- Production: https://gprd.gitlab.com/admin/geo_nodes
- Review the numbers under the Verification Information tab for the secondary node
- If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
No need to verify the integrity of anything in object storage

Perform an automated QA run against the current infrastructure

🏆 Quality : Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
🏆 Quality : Post the result in the test plan issue. This will be used as the yardstick to compare the "During switchover" automated QA run against.

Schedule the switchover

🐺 Coordinator: Ask the 🔪 Chef-Runner , 🏆 Quality , and 🐘 Database-Wrangler to perform their preflight tasks
🐺 Coordinator: Pick a date and time for the switchover itself that won't interfere with the release team's work.
🐺 Coordinator: Create a new issue in the tracker using the "failover" template
🐺 Coordinator: Create a new issue in the tracker using the "test plan" template
🐺 Coordinator: Create a new issue in the tracker using the "failback" template

Edited Aug 11, 2018 by Rémy Coutable