This project is archived. Its data is read-only.

2018-06-05 staging failover preflight checks

Pre-flight checks

GitLab Version Checks

  1. Ensure that both sides to be running the same minor version.
    • Versions can be confirmed using the Omnibus version tracker dashboards:
      • Staging
        • GCP gstg: https://performance.gstg.gitlab.net/d/TvELheimz/gitlab-omnibus-versions
        • Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
      • Production
        • GCP gprd: https://performance.gprd.gitlab.net/d/TvELheimz/gitlab-omnibus-versions
        • Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd

Object storage

  1. Ensure primary and secondary share the same object storage configuration. In config/gitlab.yml, the following keys:
    1. uploads
    2. lfs
    3. artifacts
  2. Ensure the container registry has the same object storage configuration on primary and secondary
  3. Ensure all artifacts and LFS objects are in object storage
    • If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
    • On staging, these numbers are non-zero. Just mark as checked.
    1. LfsObject.with_files_stored_locally.count # => 0
    2. Ci::JobArtifact.with_files_stored_locally.count # => 0

Pre-migrated services

  1. Check that the container registry has been migrated to GCP

Configuration checks

  1. Ensure gitlab-rake gitlab:geo:check reports no errors on the primary or secondary
    • A warning may be output regarding AuthorizedKeysCommand. This is OK, and tracked in infrastructure#4280.
  2. Compare some files on a representative node (a web worker) between primary and secondary:
    1. Manually compare the diff of /etc/gitlab/gitlab.rb
    2. Manually compare the diff of /etc/gitlab/gitlab-secrets.json
    3. Ensure /etc/gitlab/gitlab-registry.key is identical
  3. Check SSH host keys match
    • Staging: bin/compare-host-keys staging.gitlab.com gstg.gitlab.com
    • Production: bin/compare-host-keys gitlab.com gprd.gitlab.com
  4. PRODUCTION ONLY UNTESTED Ensure SSH host keys match for the altssh alias:
    • SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com
  5. Ensure repository and wiki verification feature flag shows as enabled on both primary and secondary
    • Feature.enabled?(:geo_repository_verification)
  6. Ensure the TTL for the staging.gitlab.com DNS records is low (300 seconds is fine)
  7. PRODUCTION ONLY UNTESTED Ensure the secondary can send emails
    1. Run the following in a Rails console (changing you to yourself): Notify.test_email("you+test@gitlab.com", "Test email", "test")
    2. Ensure you receive the email
  8. Ensure SSL configuration on the secondary is valid for primary domain names too
    • Handy script in the migration repository: bin/check-ssl <hostname>:<port>
    • Staging: [registry.]gstg.gitlab.com -> [registry.]staging.gitlab.com
    • Production: [registry.]gprd.gitlab.com -> [registry.]gitlab.com
  9. Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:
    1. bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'
  10. Ensure that mailroom nodes have been configured with the right roles:
    • Staging: bundle exec knife ssh "role:gstg-base-be-mailroom" hostname
    • Production: bundle exec knife ssh "role:gprd-base-be-mailroom" hostname

Ensure Geo replication is up to date

  1. Ensure databse replication is healthy and up to date
    • Create a test issue on the primary and wait for it to appear on the secondary
    • This should take less than 5 minutes at most
  2. Ensure sidekiq is healthy: busy+enqueued+retries+scheduled jobs should total less than 10,000, with fewer than 100 retries
    • Staging: https://staging.gitlab.com/admin/background_jobs
    • Production: Staging: https://gitlab.com/admin/background_jobs
    • From a rails console: Sidekiq::Stats.new
    • "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
    • "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
  3. Ensure attachments, repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%):
    • Staging: https://staging.gitlab.com/admin/geo_nodes
    • Production: https://gitlab.com/admin/geo_nodes
    • Observe the "Sync Information" tab for the secondary
    • See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
    • Staging: some failures and unsynced repositories are expected
  4. Local CI artifacts and LFS objects should have 0 in all columns
    • Staging: some failures and unsynced files are expected
    • Production: this may fluctuate around 0 due to background upload. This is OK.
    • Artifacts are not migrated to object storage, so these need to be 100% complete
  5. Ensure Geo event log is being processed
    • In a rails console for both primary and secondary: Geo::EventLog.maximum(:id)
    • In a rails console for the secondary: Geo::EventLogState.last_processed
    • All numbers should be within 10,000 of each other.

Verify the integrity of replicated repositories and wikis

  1. Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
    • Staging: https://gstg.gitlab.com/admin/geo_nodes
    • Production: https://gprd.gitlab.com/admin/geo_nodes
    • Review the numbers under the Verification Information tab for the secondary node
    • If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
  2. No need to verify the integrity of anything in object storage

Pages

  1. Verify that Pages Azure-to-GCP Proxy is correctly working (see https://gitlab.com/gitlab-com/migration/issues/159)
  2. Perform GitLab Pages data verification (see https://gitlab.com/gitlab-com/migration/issues/388)

Schedule the failover

  1. Pick a date and time for the failover itself that won't interfere with the release team's work.
  2. Verify with RMs for the next release that the chosen date is OK
  3. Create a new issue in the tracker using the "failover" template
  4. Create a new issue in the tracker using the "test plan" template
  5. Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues
Edited Jun 05, 2018 by Nick Thomas
Assignee Loading
Time tracking Loading