2018-06-05 staging failover preflight checks

Pre-flight checks

GitLab Version Checks

Ensure that both sides to be running the same minor version.
- Versions can be confirmed using the Omnibus version tracker dashboards:
  - Staging
    - GCP gstg: https://performance.gstg.gitlab.net/d/TvELheimz/gitlab-omnibus-versions
    - Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
  - Production
    - GCP gprd: https://performance.gprd.gitlab.net/d/TvELheimz/gitlab-omnibus-versions
    - Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd

Object storage

Ensure primary and secondary share the same object storage configuration. In config/gitlab.yml, the following keys:
1. uploads
2. lfs
3. artifacts
Ensure the container registry has the same object storage configuration on primary and secondary
Ensure all artifacts and LFS objects are in object storage
- If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
- On staging, these numbers are non-zero. Just mark as checked.
1. LfsObject.with_files_stored_locally.count # => 0
2. Ci::JobArtifact.with_files_stored_locally.count # => 0

Pre-migrated services

Check that the container registry has been migrated to GCP

Configuration checks

Ensure Geo replication is up to date

Ensure databse replication is healthy and up to date
- Create a test issue on the primary and wait for it to appear on the secondary
- This should take less than 5 minutes at most
Ensure sidekiq is healthy: busy+enqueued+retries+scheduled jobs should total less than 10,000, with fewer than 100 retries
- Staging: https://staging.gitlab.com/admin/background_jobs
- Production: Staging: https://gitlab.com/admin/background_jobs
- From a rails console: Sidekiq::Stats.new
- "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
- "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
Ensure attachments, repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%):
- Staging: https://staging.gitlab.com/admin/geo_nodes
- Production: https://gitlab.com/admin/geo_nodes
- Observe the "Sync Information" tab for the secondary
- See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
- Staging: some failures and unsynced repositories are expected
Local CI artifacts and LFS objects should have 0 in all columns
- Staging: some failures and unsynced files are expected
- Production: this may fluctuate around 0 due to background upload. This is OK.
- Artifacts are not migrated to object storage, so these need to be 100% complete
Ensure Geo event log is being processed
- In a rails console for both primary and secondary: Geo::EventLog.maximum(:id)
- In a rails console for the secondary: Geo::EventLogState.last_processed
- All numbers should be within 10,000 of each other.

Verify the integrity of replicated repositories and wikis

Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
- Staging: https://gstg.gitlab.com/admin/geo_nodes
- Production: https://gprd.gitlab.com/admin/geo_nodes
- Review the numbers under the Verification Information tab for the secondary node
- If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
No need to verify the integrity of anything in object storage

Schedule the failover

Pick a date and time for the failover itself that won't interfere with the release team's work.
Verify with RMs for the next release that the chosen date is OK
Create a new issue in the tracker using the "failover" template
Create a new issue in the tracker using the "test plan" template
Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues

Edited Jun 05, 2018 by Nick Thomas

2018-06-05 staging failover preflight checks

Pre-flight checks

GitLab Version Checks

Object storage

Pre-migrated services

Configuration checks

Ensure Geo replication is up to date

Verify the integrity of replicated repositories and wikis

Pages

Schedule the failover