Feedback from the 2018-08-07 failover attempt

parent 255c9186
Pipeline #88817 passed with stage
in 28 seconds
......@@ -177,7 +177,7 @@ an hour before the scheduled maintenance window.
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
* `sudo crontab -e` to get an editor window, comment out the line involving rsync
* `sudo crontab -e` to get an editor window, comment out the line involving a pages-sync script
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after the transfer starts will be lost.
......@@ -289,6 +289,7 @@ Running CI jobs will no longer be able to push updates. Jobs that complete now m
* In a separate terminal on the deploy host: `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* The loop should be stopped once sidekiq is shut down
* Wait for `--> Status: PROCEED`
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
......@@ -325,9 +326,9 @@ state of the secondary to converge.
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [ ] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
* [ ] CI traces in Redis
* Run `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [ ] 🐺 {+ Coordinator +}: Flush CI traces in Redis to the database
* In a Rails console in Azure:
* `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
......@@ -355,10 +356,10 @@ state of the secondary to converge.
* The loop should be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Review status of the running Sidekiq monitor script started in [phase 2, above](#phase-2-commence-shutdown-in-azure-), wait for `--> Status: PROCEED`
* Need more details?
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* `Busy`, `Enqueued`, `Scheduled`, and `Retry` should all be 0
* If a `geo_metrics_update` job is running, that can be ignored
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
......
......@@ -13,7 +13,7 @@ if ENV["FAILOVER_ENVIRONMENT"] == "stg" || `hostname -f` == "deploy.stg.gitlab.c
$purge_allowed << "background_migration"
end
$dry_run = true
$dry_run = false
def queue_can_be_purged(queue_name)
# Make sure that the geo crons are not included in this list...
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment