Address feedback from 2018-08-09 failover attempt

parent edca1216
Pipeline #89011 passed with stage
in 18 seconds
......@@ -327,6 +327,8 @@ state of the secondary to converge.
1. [ ] 🐺 {+ Coordinator +}: Flush CI traces in Redis to the database
* In a Rails console in Azure:
* `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [ ] 🐺 {+ Coordinator +}: Reconcile negative registry entries
* Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
......@@ -342,8 +344,11 @@ state of the secondary to converge.
* You can also use `sudo gitlab-rake geo:status`
* If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
* On staging, verification may not complete
1. [ ] 🐺 {+ Coordinator +}: In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure the prospective failover target in GCP is up to date
1. [ ] 🐺 {+ Coordinator +}: Ensure the whole event log has been processed
* In Azure: `Geo::EventLog.maximum(:id)`
* In GCP: `Geo::EventLogState.last_processed.id`
* The two numbers should be the same
1. [ ] 🐺 {+ Coordinator +}: Ensure the prospective failover target in GCP is up to date
* `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p03/check-wal-secondary-sync.sh`
* Assuming the clocks are in sync, this value should be close to 0
* If this is a large number, GCP may not have some data that is in Azure
......@@ -448,12 +453,9 @@ of errors while it is being promoted.
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
* Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that important processes have been restarted on all hosts
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that Unicorn processes have been restarted on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* Production: `knife ssh roles:gprd-base 'sudo gitlab-ctl status 2>/dev/null' | sort -k 3`
* [ ] Unicorn
* [ ] Sidekiq
* [ ] Gitlab Pages
1. [ ] 🔪 {+ Chef-Runner +}: Fix the Geo node hostname for the old secondary
* This ensures we continue to generate Geo event logs for a time, maybe useful for last-gasp failback
* In a Rails console in GCP:
......@@ -548,6 +550,9 @@ unexpected ways.
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
......@@ -567,9 +572,7 @@ unexpected ways.
1. [ ] 🔪 {+ Chef-Runner +}: Make the GCP environment accessible to the outside world
* Staging: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112
* Production: Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.where("id NOT IN (?)", Ci::Runner.instance_type.joins(:taggings).joins(:tags).where("tags.name = ?", "gitlab-org").pluck(:id)).update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
......
......@@ -154,7 +154,8 @@
* This may be `nil`. If so, perform a `git push` to a random project to generate a new event
* In a rails console for the secondary: `Geo::EventLogState.last_processed`
* All numbers should be within 10,000 of each other.
1. [ ] 🐺 {+ Coordinator +}: Reconcile negative registry entries
* Follow the instructions in https://dev.gitlab.org/gitlab-com/migration/blob/master/runbooks/geo/negative-out-of-sync-metrics.md
## Verify the integrity of replicated repositories and wikis
......
......@@ -18,6 +18,9 @@ read -r
ssh_host "$POSTGRESQL_AZURE_PRIMARY" "sudo gitlab-ctl stop postgresql"
echo "Waiting an appropriate time..."
sleep 5
echo "Getting status:"
p_status=$(ssh_host "$POSTGRESQL_AZURE_PRIMARY" "sudo gitlab-ctl status postgresql")
echo "$POSTGRESQL_AZURE_PRIMARY: $p_status"
......
......@@ -25,6 +25,9 @@ do
ssh_host "$secondary" "sudo gitlab-ctl stop postgresql"
done
echo "Waiting an appropriate time..."
sleep 5
echo "Getting status:"
for secondary in ${POSTGRESQL_AZURE_SECONDARIES[*]}
do
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment