From 20d9d91e983759241581a36e34b038b7ccdd3a60 Mon Sep 17 00:00:00 2001 From: Nick Thomas Date: Thu, 9 Aug 2018 17:33:28 +0100 Subject: [PATCH] Address feedback from 2018-08-09 failover attempt --- .gitlab/issue_templates/failover.md | 21 +++++++++++-------- .gitlab/issue_templates/preflight_checks.md | 3 ++- .../060_go/p04/shutdown-azure-primary.sh | 3 +++ .../060_go/p04/shutdown-azure-secondaries.sh | 3 +++ 4 files changed, 20 insertions(+), 10 deletions(-) diff --git a/.gitlab/issue_templates/failover.md b/.gitlab/issue_templates/failover.md index f4946ef..f201ec3 100644 --- a/.gitlab/issue_templates/failover.md +++ b/.gitlab/issue_templates/failover.md @@ -327,6 +327,8 @@ state of the secondary to converge. 1. [ ] 🐺 {+ Coordinator +}: Flush CI traces in Redis to the database * In a Rails console in Azure: * `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)` +1. [ ] 🐺 {+ Coordinator +}: Reconcile negative registry entries + * Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md 1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized * Staging: https://gstg.gitlab.com/admin/geo_nodes * Production: https://gprd.gitlab.com/admin/geo_nodes @@ -342,8 +344,11 @@ state of the secondary to converge. * You can also use `sudo gitlab-rake geo:status` * If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync * On staging, verification may not complete -1. [ ] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor" -1. [ ] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date +1. [ ] 🐺 {+ Coordinator +}: Ensure the whole event log has been processed + * In Azure: `Geo::EventLog.maximum(:id)` + * In GCP: `Geo::EventLogState.last_processed.id` + * The two numbers should be the same +1. [ ] 🐺 {+ Coordinator +}: Ensure the prospective failover target in GCP is up to date * `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p03/check-wal-secondary-sync.sh` * Assuming the clocks are in sync, this value should be close to 0 * If this is a large number, GCP may not have some data that is in Azure @@ -448,12 +453,9 @@ of errors while it is being promoted. 1. [ ] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts * Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2` * Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2` -1. [ ] 🔪 {+ Chef-Runner +}: Ensure that important processes have been restarted on all hosts +1. [ ] 🔪 {+ Chef-Runner +}: Ensure that Unicorn processes have been restarted on all hosts * Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3` * Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3` - * [ ] Unicorn - * [ ] Sidekiq - * [ ] Gitlab Pages 1. [ ] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary * This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback * In a Rails console in GCP: @@ -548,6 +550,9 @@ unexpected ways. #### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08) +1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours + * In a Rails console, run: + * `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)` 1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr - [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time 1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration @@ -567,9 +572,7 @@ unexpected ways. 1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world * Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112 * Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322 -1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours - * In a Rails console, run: - * `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)` + #### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09) diff --git a/.gitlab/issue_templates/preflight_checks.md b/.gitlab/issue_templates/preflight_checks.md index 147b068..ad72318 100644 --- a/.gitlab/issue_templates/preflight_checks.md +++ b/.gitlab/issue_templates/preflight_checks.md @@ -154,7 +154,8 @@ * This may be `nil`. If so, perform a `git push` to a random project to generate a new event * In a rails console for the secondary: `Geo::EventLogState.last_processed` * All numbers should be within 10,000 of each other. - +1. [ ] 🐺 {+ Coordinator +}: Reconcile negative registry entries + * Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md ## Verify the integrity of replicated repositories and wikis diff --git a/bin/scripts/02_failover/060_go/p04/shutdown-azure-primary.sh b/bin/scripts/02_failover/060_go/p04/shutdown-azure-primary.sh index 9a03565..ac3d77a 100755 --- a/bin/scripts/02_failover/060_go/p04/shutdown-azure-primary.sh +++ b/bin/scripts/02_failover/060_go/p04/shutdown-azure-primary.sh @@ -18,6 +18,9 @@ read -r ssh_host "$POSTGRESQL_AZURE_PRIMARY" "sudo gitlab-ctl stop postgresql" +echo "Waiting an appropriate time..." +sleep 5 + echo "Getting status:" p_status=$(ssh_host "$POSTGRESQL_AZURE_PRIMARY" "sudo gitlab-ctl status postgresql") echo "$POSTGRESQL_AZURE_PRIMARY: $p_status" diff --git a/bin/scripts/02_failover/060_go/p04/shutdown-azure-secondaries.sh b/bin/scripts/02_failover/060_go/p04/shutdown-azure-secondaries.sh index 4ab373a..986e240 100755 --- a/bin/scripts/02_failover/060_go/p04/shutdown-azure-secondaries.sh +++ b/bin/scripts/02_failover/060_go/p04/shutdown-azure-secondaries.sh @@ -25,6 +25,9 @@ do ssh_host "$secondary" "sudo gitlab-ctl stop postgresql" done +echo "Waiting an appropriate time..." +sleep 5 + echo "Getting status:" for secondary in ${POSTGRESQL_AZURE_SECONDARIES[*]} do -- GitLab