Skip to content

2018-06-07 staging failover preflight checks

Pre-flight checks

GitLab Version Checks

  1. Ensure that both sides to be running the same minor version.

Object storage

  1. Ensure primary and secondary share the same object storage configuration. In config/gitlab.yml, the following keys:
    1. uploads
    2. lfs
    3. artifacts
  2. Ensure the container registry has the same object storage configuration on primary and secondary
  3. Ensure all artifacts and LFS objects are in object storage
    • If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
    • On staging, these numbers are non-zero. Just mark as checked.
    1. LfsObject.with_files_stored_locally.count # => 0
    2. Ci::JobArtifact.with_files_stored_locally.count # => 0

Pre-migrated services

  1. Check that the container registry has been migrated to GCP

Configuration checks

  1. Ensure gitlab-rake gitlab:geo:check reports no errors on the primary or secondary
    • A warning may be output regarding AuthorizedKeysCommand. This is OK, and tracked in infrastructure#4280.
  2. Compare some files on a representative node (a web worker) between primary and secondary:
    1. Manually compare the diff of /etc/gitlab/gitlab.rb
    2. Manually compare the diff of /etc/gitlab/gitlab-secrets.json
    3. Ensure /etc/gitlab/gitlab-registry.key is identical
  3. Check SSH host keys match
    • Staging: bin/compare-host-keys staging.gitlab.com gstg.gitlab.com
    • Production: bin/compare-host-keys gitlab.com gprd.gitlab.com
  4. PRODUCTION ONLY UNTESTED Ensure SSH host keys match for the altssh alias:
    • SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com
  5. Ensure repository and wiki verification feature flag shows as enabled on both primary and secondary
    • Feature.enabled?(:geo_repository_verification)
  6. Ensure the TTL for the staging.gitlab.com DNS records is low (300 seconds is fine)
  7. PRODUCTION ONLY UNTESTED Ensure the secondary can send emails
    1. Run the following in a Rails console (changing you to yourself): Notify.test_email("you+test@gitlab.com", "Test email", "test")
    2. Ensure you receive the email
  8. Ensure SSL configuration on the secondary is valid for primary domain names too
    • Handy script in the migration repository: bin/check-ssl <hostname>:<port>
    • Staging: [registry.]gstg.gitlab.com -> [registry.]staging.gitlab.com
    • Production: [registry.]gprd.gitlab.com -> [registry.]gitlab.com
  9. Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
    1. bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'
  10. Ensure that mailroom nodes have been configured with the right roles:
    • Staging: bundle exec knife ssh "role:gstg-base-be-mailroom" hostname
    • Production: bundle exec knife ssh "role:gprd-base-be-mailroom" hostname
  11. Outstanding Chef merge requests are up to date and no merge conflicts exist

Ensure Geo replication is up to date

  1. Ensure databse replication is healthy and up to date
    • Create a test issue on the primary and wait for it to appear on the secondary
    • This should take less than 5 minutes at most
  2. Ensure sidekiq is healthy: busy+enqueued+retries+scheduled jobs should total less than 10,000, with fewer than 100 retries
  3. Ensure attachments, repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%):
  4. Local CI artifacts and LFS objects should have 0 in all columns
    • Staging: some failures and unsynced files are expected
    • Production: this may fluctuate around 0 due to background upload. This is OK.
    • Artifacts are not migrated to object storage, so these need to be 100% complete
  5. Ensure Geo event log is being processed
    • In a rails console for both primary and secondary: Geo::EventLog.maximum(:id)
    • In a rails console for the secondary: Geo::EventLogState.last_processed
    • All numbers should be within 10,000 of each other.

Verify the integrity of replicated repositories and wikis

  1. Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
  2. No need to verify the integrity of anything in object storage

Pages

  1. Verify that Pages Azure-to-GCP Proxy is correctly working (see https://gitlab.com/gitlab-com/migration/issues/159)
  2. Perform GitLab Pages data verification (see https://gitlab.com/gitlab-com/migration/issues/388)

Schedule the failover

  1. Pick a date and time for the failover itself that won't interfere with the release team's work.
  2. Verify with RMs for the next release that the chosen date is OK
  3. Create a new issue in the tracker using the "failover" template
  4. Create a new issue in the tracker using the "test plan" template
  5. Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues
Edited by Brett Walker