Skip to content

2018-08-11 PRODUCTION switchover attempt: preflight checks

Pre-flight checks

Dashboards and Alerts

  1. 🐺 Coordinator: Ensure that there are no active alerts on the azure or gcp environment.
  2. 🐺 Coordinator: Review the switchover dashboards for GCP and Azure (https://gitlab.com/gitlab-com/migration/issues/485)

GitLab Version and CDN Checks

  1. 🐺 Coordinator: Ensure that both sides to be running the same minor version. It's ok if the minor version differs for db nodes (tier == db) - as there have been problems in the past auto-restarting the databases - now they only get updated in a controlled way

  2. 🐺 Coordinator: Ensure that the fastly CDN ip ranges are up-to-date.

Object storage

  1. 🐺 Coordinator: Ensure primary and secondary share the same object storage configuration. For each line below, execute the line first on the primary console, copy the results to the clipboard, then execute the same line on the secondary console, appending ==, and pasting the results from the primary console. You should get a true or false value.
    1. Gitlab.config.uploads
    2. Gitlab.config.lfs
    3. Gitlab.config.artifacts
  2. 🐺 Coordinator: Ensure all artifacts and LFS objects are in object storage
    • If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
    1. Upload.with_files_stored_locally.count # => 0
    2. LfsObject.with_files_stored_locally.count # => 13 (there are a small number of known-lost LFS objects)
    3. Ci::JobArtifact.with_files_stored_locally.count # => 0

Pre-migrated services

  1. 🐺 Coordinator: Check that the container registry has been pre-migrated to GCP

Configuration checks

  1. 🐺 Coordinator: Ensure gitlab-rake gitlab:geo:check reports no errors on the primary or secondary

    • A warning may be output regarding AuthorizedKeysCommand. This is OK, and tracked in infrastructure#4280.
  2. Compare some files on a representative node (a web worker) between primary and secondary:

    1. Manually compare the diff of /etc/gitlab/gitlab.rb
    2. Manually compare the diff of /etc/gitlab/gitlab-secrets.json
  3. 🐺 Coordinator: Check SSH host keys match

    • Production:
      • bin/compare-host-keys gitlab.com gprd.gitlab.com
      • SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com
  4. 🐺 Coordinator: Ensure repository and wiki verification feature flag shows as enabled on both primary and secondary

    • Feature.enabled?(:geo_repository_verification)
  5. 🐺 Coordinator: Ensure the TTL for affected DNS records is low

    • 300 seconds is fine
    • Production:
      • gitlab.com
      • altssh.gitlab.com
      • gitlab-org.gitlab.io
  6. 🐺 Coordinator: Ensure SSL configuration on the secondary is valid for primary domain names too

    • Handy script in the migration repository: bin/check-ssl <hostname>:<port>
    • Production:
      • bin/check-ssl gprd.gitlab.com:443
      • bin/check-ssl gitlab-org.gprd.gitlab.io:443
  7. 🔪 Chef-Runner: Ensure SSH connectivity to all hosts, including host key verification

    • chef-client role:gitlab-base pwd
  8. 🔪 Chef-Runner: Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes:

    1. bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'
  9. 🔪 Chef-Runner: Ensure that mailroom nodes have been configured with the right roles:

    • Production: bundle exec knife ssh "role:gprd-base-be-mailroom" hostname
  10. 🔪 Chef-Runner : Ensure all hot-patches are applied to the target environment:

    1. Fetch the latest version of post-deployment-patches
    2. Check the omnibus version running in the target environment
      • Production: knife role show gprd-omnibus-version | grep version:
    3. In post-deployment-patches, ensure that the version maninfest has a corresponding GCP Chef role under the target environment
      • E.g. In 11.1/MANIFEST.yml, versions.11.1.0-rc10-ee.environments.staging should have gstg-base-fe-api along with staging-base-fe-api
    4. Run gitlab-patcher -mode patch -workdir /path/to/post-deployment-patches/version -chef-repo /path/to/chef-repo target-version staging-or-prod
      • The command can fail because the patches may have already been applied, that's OK.
  11. 🔪 Chef-Runner: Outstanding merge requests are up to date vs. master:

  12. 🐘 Database-Wrangler : Ensure gitlab-ctl repmgr cluster show works on all database nodes

Ensure Geo replication is up to date

  1. 🐺 Coordinator: Ensure database replication is healthy and up to date
    • Create a test issue on the primary and wait for it to appear on the secondary
    • This should take less than 5 minutes at most
  2. 🐺 Coordinator: Ensure sidekiq is healthy
    • Busy + Enqueued + Retries should total less than 10,000, with fewer than 100 retries
    • Scheduled jobs should not be present, or should all be scheduled to be run before the switchover starts
    • Production: https://gitlab.com/admin/background_jobs
    • From a rails console: Sidekiq::Stats.new
    • "Dead" jobs will be lost on switchover but can be ignored as we routinely ignore them
    • "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
  3. 🐺 Coordinator: Ensure repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%):
  4. 🐺 Coordinator: Local CI artifacts, LFS objects and Uploads should have 0 in all columns
    • Production: this may fluctuate around 0 due to background upload. This is OK.
  5. 🐺 Coordinator: Ensure Geo event log is being processed
    • In a rails console for both primary and secondary: Geo::EventLog.maximum(:id)
      • This may be nil. If so, perform a git push to a random project to generate a new event
    • In a rails console for the secondary: Geo::EventLogState.last_processed
    • All numbers should be within 10,000 of each other.
  6. 🐺 Coordinator : Reconcile negative registry entries

Verify the integrity of replicated repositories and wikis

  1. 🐺 Coordinator: Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
  2. No need to verify the integrity of anything in object storage

Perform an automated QA run against the current infrastructure

  1. 🏆 Quality : Perform an automated QA run against the current infrastructure, using the same command as in the test plan issue
  2. 🏆 Quality : Post the result in the test plan issue. This will be used as the yardstick to compare the "During switchover" automated QA run against.

Schedule the switchover

  1. 🐺 Coordinator: Ask the 🔪 Chef-Runner , 🏆 Quality , and 🐘 Database-Wrangler to perform their preflight tasks
  2. 🐺 Coordinator: Pick a date and time for the switchover itself that won't interfere with the release team's work.
  3. 🐺 Coordinator: Create a new issue in the tracker using the "failover" template
  4. 🐺 Coordinator: Create a new issue in the tracker using the "test plan" template
  5. 🐺 Coordinator: Create a new issue in the tracker using the "failback" template
Edited by Rémy Coutable