Commit b5732e28 authored by Nick Thomas's avatar Nick Thomas

Merge branch '2018-08-02-feedback' into 'master'

Address (some) feedback from 2018-08-02 failover attempt

Closes #753

See merge request gitlab-com/migration!176
parents 718eb91c 1a2975e7
Pipeline #88662 passed with stage
in 12 seconds
...@@ -77,16 +77,25 @@ first. ...@@ -77,16 +77,25 @@ first.
* GCP: `web-01-sv-gprd.c.gitlab-production.internal` * GCP: `web-01-sv-gprd.c.gitlab-production.internal`
# Grafana dashboards # Dashboards and debugging
These dashboards might be useful during the failover: * These dashboards might be useful during the failover:
* Staging:
* Staging: * Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg * GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg * Production:
* Production: * Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd * GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd * Sentry includes application errors. At present, Azure and GCP log to the same Sentry instance
* Staging: https://sentry.gitlap.com/gitlab/staginggitlabcom/
* Production:
* Workhorse: https://sentry.gitlap.com/gitlab/gitlab-workhorse-gitlabcom/
* Rails (backend): https://sentry.gitlap.com/gitlab/gitlabcom/
* Rails (frontend): https://sentry.gitlap.com/gitlab/gitlabcom-clientside/
* Gitaly (golang): https://sentry.gitlap.com/gitlab/gitaly-production/
* Gitaly (ruby): https://sentry.gitlap.com/gitlab/gitlabcom-gitaly-ruby/
* The logs can be used to inspect any area of the stack in more detail
* https://log.gitlab.net/
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w) # **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
...@@ -144,15 +153,16 @@ an hour before the scheduled maintenance window. ...@@ -144,15 +153,16 @@ an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now. 1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s): 1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd` - `environment`: `prd`
1. [ ] **PRODUCTION ONLY** 🔪 {+ Chef-Runner +}: Silence production alerts
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
* `environment`: `prd`
* `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed 1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request` * Block `POST /api/v4/jobs/request`
* Staging * Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094 * https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'` * `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Production * Production
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
- `environment`: `prd`
- `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243 * https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'` * `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [ ] ☎ {+ Comms-Handler +}: Create a broadcast message - [ ] ☎ {+ Comms-Handler +}: Create a broadcast message
...@@ -166,6 +176,7 @@ an hour before the scheduled maintenance window. ...@@ -166,6 +176,7 @@ an hour before the scheduled maintenance window.
* Production: `bin/snapshot-dbs production` * Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync 1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server * Disable the cronjob on the **Azure** pages NFS server
* This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
* `sudo crontab -e` to get an editor window, comment out the line involving rsync * `sudo crontab -e` to get an editor window, comment out the line involving rsync
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync 1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server! * Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
...@@ -275,6 +286,7 @@ Running CI jobs will no longer be able to push updates. Jobs that complete now m ...@@ -275,6 +286,7 @@ Running CI jobs will no longer be able to push updates. Jobs that complete now m
1. [ ] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down: 1. [ ] 🐺 {+ Coordinator +}: Sidekiq monitor: start purge of non-mandatory jobs, disable Sidekiq crons and allow Sidekiq to wind-down:
* In a separate terminal on the deploy host: `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh` * In a separate terminal on the deploy host: `/opt/gitlab-migration/bin/scripts/02_failover/060_go/p02/030-await-sidekiq-drain.sh`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop * The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* The loop may be stopped once sidekiq is shut down
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete 1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node * Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node * Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
...@@ -337,6 +349,7 @@ state of the secondary to converge. ...@@ -337,6 +349,7 @@ state of the secondary to converge.
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary 1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**: * In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }` * `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The loop may be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop * The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary 1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Review status of the running Sidekiq monitor script started in [phase 2, above](#phase-2-commence-shutdown-in-azure-), wait for `--> Status: PROCEED` * Review status of the running Sidekiq monitor script started in [phase 2, above](#phase-2-commence-shutdown-in-azure-), wait for `--> Status: PROCEED`
...@@ -401,19 +414,21 @@ of errors while it is being promoted. ...@@ -401,19 +414,21 @@ of errors while it is being promoted.
1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow. 1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state - [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
- [ ] Confirm pgbouncer node in GCP (Password is in 1password) - [ ] Confirm pgbouncer node in GCP (Password is in 1password)
* Staging: `pgbouncer-01-db-gstg`
```shell * Production: `pgbouncer-01-db-gprd`
$ gitlab-ctl pgb-console
... ```shell
pgbouncer# SHOW DATABASES; $ gitlab-ctl pgb-console
# You want to see lines like ...
gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 100 | 5 | | 0 | 0 pgbouncer# SHOW DATABASES;
gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 150 | 5 | | 0 | 0 # You want to see lines like
... gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 100 | 5 | | 0 | 0
pgbouncer# SHOW SERVERS; gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 150 | 5 | | 0 | 0
# You want to see lines like ...
S | gitlab | gitlabhq_production | idle | PRIMARY_IP | 5432 | PGBOUNCER_IP | 54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 | | 19430 | pgbouncer# SHOW SERVERS;
``` # You want to see lines like
S | gitlab | gitlabhq_production | idle | PRIMARY_IP | 5432 | PGBOUNCER_IP | 54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 | | 19430 |
```
1. [ ] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover 1. [ ] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [ ] Promote the desired primary - [ ] Promote the desired primary
......
...@@ -174,7 +174,7 @@ ...@@ -174,7 +174,7 @@
## Schedule the failover ## Schedule the failover
1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +} and 🐘 {+ Database-Wrangler +} to perform their preflight tasks 1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +}, 🏆 {+ Quality +}, and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
1. [ ] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work. 1. [ ] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work.
1. [ ] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK 1. [ ] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover) 1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
......
...@@ -76,14 +76,7 @@ class Check ...@@ -76,14 +76,7 @@ class Check
end end
def check_ssh def check_ssh
result = popen(%W[ result = popen(%W[ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -oConnectTimeout=5 -p #{ssh_port} -q -T git@#{hostname}])
ssh -o UserKnownHostsFile=/dev/null \
-o StrictHostKeyChecking=no \
-oConnectTimeout=5 \
-p #{ssh_port} \
-q -T \
git@#{hostname}
])
if result =~ /Welcome to GitLab/ if result =~ /Welcome to GitLab/
"Yes" "Yes"
...@@ -119,7 +112,9 @@ class Check ...@@ -119,7 +112,9 @@ class Check
return nil unless result return nil unless result
orgname = result.lines.find { |l| l =~ /OrgName:/ } orgname = result.lines.find { |l| l =~ /OrgName:/ }
orgname&.split(":", 2)[1].strip return nil unless orgname
orgname.split(":", 2)[1].strip
end end
def curl(url) def curl(url)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment