Commit 6597b28a authored by Nick Thomas's avatar Nick Thomas

Merge branch '2018-08-08-feedback' into 'master'

Feedback from the 2018-08-07 failover attempt

See merge request gitlab-com/migration!191
parents 255c9186 25750b32
Pipeline #88836 passed with stage
in 16 seconds
......@@ -177,7 +177,7 @@ an hour before the scheduled maintenance window.
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
* `sudo crontab -e` to get an editor window, comment out the line involving rsync
* `sudo crontab -e` to get an editor window, comment out the line involving a pages-sync script
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after the transfer starts will be lost.
......@@ -243,12 +243,11 @@ you see something happening that shouldn't be public, mention it.
1. [ ] 🐺 {+ Coordinator +}: Ensure that there are no active alerts on the azure or gcp environment.
* Staging
* Staging
* GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
* Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
* Production
* GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
* Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
* GCP `gstg`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gstg&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gstg&var-prometheus_app=prometheus-app-01-inf-gstg
* Azure Staging: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22stg%22%7D
* Production
* GCP `gprd`: https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
* Azure Production: https://alerts.gitlab.com/#/alerts?silenced=false&inhibited=false&filter=%7Benvironment%3D%22prd%22%7D
# T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)
......@@ -289,6 +288,7 @@ Running CI jobs will no longer be able to push updates. Jobs that complete now m
* In a separate terminal on the deploy host: `/opt/gitlab-migration/migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* The loop should be stopped once sidekiq is shut down
* Wait for `--> Status: PROCEED`
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
......@@ -325,9 +325,9 @@ state of the secondary to converge.
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [ ] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
* [ ] CI traces in Redis
* Run `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [ ] 🐺 {+ Coordinator +}: Flush CI traces in Redis to the database
* In a Rails console in Azure:
* `::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)`
1. [ ] 🐺 {+ Coordinator +}: Wait for all repositories and wikis to become synchronized
* Staging: https://gstg.gitlab.com/admin/geo_nodes
* Production: https://gprd.gitlab.com/admin/geo_nodes
......@@ -355,10 +355,10 @@ state of the secondary to converge.
* The loop should be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Review status of the running Sidekiq monitor script started in [phase 2, above](#phase-2-commence-shutdown-in-azure-), wait for `--> Status: PROCEED`
* Need more details?
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
* Production: Navigate to [https://gprd.gitlab.com/admin/background_jobs](https://gprd.gitlab.com/admin/background_jobs)
* `Busy`, `Enqueued`, `Scheduled`, and `Retry` should all be 0
* If a `geo_metrics_update` job is running, that can be ignored
1. [ ] 🔪 {+ Chef-Runner +}: Stop sidekiq in GCP
* This ensures the postgresql promotion can happen and gives a better guarantee of sidekiq consistency
* Staging: `knife ssh roles:gstg-base-be-sidekiq "sudo gitlab-ctl stop sidekiq-cluster"`
......
......@@ -13,7 +13,7 @@ if ENV["FAILOVER_ENVIRONMENT"] == "stg" || `hostname -f` == "deploy.stg.gitlab.c
$purge_allowed << "background_migration"
end
$dry_run = true
$dry_run = false
def queue_can_be_purged(queue_name)
# Make sure that the geo crons are not included in this list...
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment