2018-06-05 staging failover preflight checks
Pre-flight checks
GitLab Version Checks
-
Ensure that both sides to be running the same minor version. - Versions can be confirmed using the Omnibus version tracker dashboards:
- Staging
- Production
- Versions can be confirmed using the Omnibus version tracker dashboards:
Object storage
-
Ensure primary and secondary share the same object storage configuration. In config/gitlab.yml
, the following keys:-
uploads
-
lfs
-
artifacts
-
-
Ensure the container registry has the same object storage configuration on primary and secondary -
Ensure all artifacts and LFS objects are in object storage - If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
- On staging, these numbers are non-zero. Just mark as checked.
-
LfsObject.with_files_stored_locally.count # => 0 -
Ci::JobArtifact.with_files_stored_locally.count # => 0
Pre-migrated services
-
Check that the container registry has been migrated to GCP
Configuration checks
-
Ensure gitlab-rake gitlab:geo:check
reports no errors on the primary or secondary- A warning may be output regarding
AuthorizedKeysCommand
. This is OK, and tracked in infrastructure#4280.
- A warning may be output regarding
- Compare some files on a representative node (a web worker) between primary and secondary:
-
Manually compare the diff of /etc/gitlab/gitlab.rb
-
Manually compare the diff of /etc/gitlab/gitlab-secrets.json
-
Ensure /etc/gitlab/gitlab-registry.key
is identical
-
-
Check SSH host keys match - Staging:
bin/compare-host-keys staging.gitlab.com gstg.gitlab.com
- Production:
bin/compare-host-keys gitlab.com gprd.gitlab.com
- Staging:
-
PRODUCTION ONLY UNTESTED Ensure SSH host keys match for the altssh
alias:SSH_PORT=443 bin/compare-host-keys altssh.gitlab.com altssh.gprd.gitlab.com
-
Ensure repository and wiki verification feature flag shows as enabled on both primary and secondary Feature.enabled?(:geo_repository_verification)
-
Ensure the TTL for the staging.gitlab.com
DNS records is low (300 seconds is fine) -
PRODUCTION ONLY UNTESTED Ensure the secondary can send emails -
Run the following in a Rails console (changing you
to yourself):Notify.test_email("you+test@gitlab.com", "Test email", "test")
-
Ensure you receive the email
-
-
Ensure SSL configuration on the secondary is valid for primary domain names too - Handy script in the migration repository:
bin/check-ssl <hostname>:<port>
- Staging:
[registry.]gstg.gitlab.com
->[registry.]staging.gitlab.com
- Production:
[registry.]gprd.gitlab.com
->[registry.]gitlab.com
- Handy script in the migration repository:
-
Ensure that all nodes can talk to the internal API. You can ignore container registry and mailroom nodes: -
bundle exec knife ssh "roles:gstg-base-be* OR roles:gstg-base-fe* OR roles:gstg-base-stor-nfs" 'sudo -u git /opt/gitlab/embedded/service/gitlab-shell/bin/check'
-
-
Ensure that mailroom nodes have been configured with the right roles: - Staging:
bundle exec knife ssh "role:gstg-base-be-mailroom" hostname
- Production:
bundle exec knife ssh "role:gprd-base-be-mailroom" hostname
- Staging:
Ensure Geo replication is up to date
-
Ensure databse replication is healthy and up to date - Create a test issue on the primary and wait for it to appear on the secondary
- This should take less than 5 minutes at most
-
Ensure sidekiq is healthy: busy+enqueued+retries+scheduled jobs should total less than 10,000, with fewer than 100 retries - Staging: https://staging.gitlab.com/admin/background_jobs
- Production: Staging: https://gitlab.com/admin/background_jobs
- From a rails console:
Sidekiq::Stats.new
- "Dead" jobs will be lost on failover but can be ignored as we routinely ignore them
- "Failed" is just a counter that includes dead jobs for the last 5 years, so can be ignored
-
Ensure attachments, repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%): - Staging: https://staging.gitlab.com/admin/geo_nodes
- Production: https://gitlab.com/admin/geo_nodes
- Observe the "Sync Information" tab for the secondary
- See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
- Staging: some failures and unsynced repositories are expected
-
Local CI artifacts and LFS objects should have 0 in all columns - Staging: some failures and unsynced files are expected
- Production: this may fluctuate around 0 due to background upload. This is OK.
- Artifacts are not migrated to object storage, so these need to be 100% complete
-
Ensure Geo event log is being processed - In a rails console for both primary and secondary:
Geo::EventLog.maximum(:id)
- In a rails console for the secondary:
Geo::EventLogState.last_processed
- All numbers should be within 10,000 of each other.
- In a rails console for both primary and secondary:
Verify the integrity of replicated repositories and wikis
-
Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%): - Staging: https://gstg.gitlab.com/admin/geo_nodes
- Production: https://gprd.gitlab.com/admin/geo_nodes
- Review the numbers under the
Verification Information
tab for the secondary node - If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
- No need to verify the integrity of anything in object storage
Pages
-
Verify that Pages Azure-to-GCP Proxy is correctly working (see https://gitlab.com/gitlab-com/migration/issues/159) -
Perform GitLab Pages data verification (see https://gitlab.com/gitlab-com/migration/issues/388)
Schedule the failover
-
Pick a date and time for the failover itself that won't interfere with the release team's work. -
Verify with RMs for the next release that the chosen date is OK -
Create a new issue in the tracker using the "failover" template -
Create a new issue in the tracker using the "test plan" template -
Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues
Edited by Nick Thomas