Address (some) feedback from 2018-08-02 failover attempt

parent 31e02d3a
Pipeline #88500 passed with stage
in 16 seconds
......@@ -77,16 +77,25 @@ first.
* GCP: `web-01-sv-gprd.c.gitlab-production.internal`
# Grafana dashboards
# Dashboards and debugging
These dashboards might be useful during the failover:
* Staging:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* These dashboards might be useful during the failover:
* Staging:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=stg
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gstg
* Production:
* Azure: https://performance.gitlab.net/dashboard/db/gcp-failover-azure?orgId=1&var-environment=prd
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
* Sentry includes application errors. At present, Azure and GCP log to the same Sentry instance
* Staging: https://sentry.gitlap.com/gitlab/staginggitlabcom/
* Production:
* Workhorse: https://sentry.gitlap.com/gitlab/gitlab-workhorse-gitlabcom/
* Rails (backend): https://sentry.gitlap.com/gitlab/gitlabcom/
* Rails (frontend): https://sentry.gitlap.com/gitlab/gitlabcom-clientside/
* Gitaly (golang): https://sentry.gitlap.com/gitlab/gitaly-production/
* Gitaly (ruby): https://sentry.gitlap.com/gitlab/gitlabcom-gitaly-ruby/
* The logs can be used to inspect any area of the stack in more detail
* https://log.gitlab.net/
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
......@@ -144,15 +153,16 @@ an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
1. [ ] **PRODUCTION ONLY** 🔪 {+ Chef-Runner +}: Silence production alerts
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
* `environment`: `prd`
* `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Production
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
- `environment`: `prd`
- `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [ ] ☎ {+ Comms-Handler +}: Create a broadcast message
......@@ -166,6 +176,7 @@ an hour before the scheduled maintenance window.
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* This cronjob is found on the Pages Azure NFS server. The IPs are shown in the next step
* `sudo crontab -e` to get an editor window, comment out the line involving rsync
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
......@@ -277,6 +288,7 @@ Running CI jobs will no longer be able to push updates. Jobs that complete now m
* In a separate rails console on the **primary**:
* `loop { Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!); sleep 1 }`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
* The loop may be stopped once sidekiq is shut down
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
......@@ -339,6 +351,7 @@ state of the secondary to converge.
1. [ ] 🐺 {+ Coordinator +}: Now disable all sidekiq-cron jobs on the secondary
* In a dedicated rails console on the **secondary**:
* `loop { Sidekiq::Cron::Job.all.map(&:disable!); sleep 1 }`
* The loop may be stopped once sidekiq is shut down
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the secondary
* Staging: Navigate to [https://gstg.gitlab.com/admin/background_jobs](https://gstg.gitlab.com/admin/background_jobs)
......@@ -405,19 +418,21 @@ of errors while it is being promoted.
1. [ ] 🐘 {+ Database-Wrangler +}: After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.
- [ ] Confirm `gitlab-ctl repmgr cluster show` reflects the desired state
- [ ] Confirm pgbouncer node in GCP (Password is in 1password)
```shell
$ gitlab-ctl pgb-console
...
pgbouncer# SHOW DATABASES;
# You want to see lines like
gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 100 | 5 | | 0 | 0
gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 150 | 5 | | 0 | 0
...
pgbouncer# SHOW SERVERS;
# You want to see lines like
S | gitlab | gitlabhq_production | idle | PRIMARY_IP | 5432 | PGBOUNCER_IP | 54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 | | 19430 |
```
* Staging: `pgbouncer-01-db-gstg`
* Production: pgbouncer-01-db-gprd`
```shell
$ gitlab-ctl pgb-console
...
pgbouncer# SHOW DATABASES;
# You want to see lines like
gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 100 | 5 | | 0 | 0
gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 150 | 5 | | 0 | 0
...
pgbouncer# SHOW SERVERS;
# You want to see lines like
S | gitlab | gitlabhq_production | idle | PRIMARY_IP | 5432 | PGBOUNCER_IP | 54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 | | 19430 |
```
1. [ ] 🐘 {+ Database-Wrangler +}: In case automated failover does not occur, perform a manual failover
- [ ] Promote the desired primary
......
......@@ -174,7 +174,7 @@
## Schedule the failover
1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +} and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
1. [ ] 🐺 {+Coordinator+}: Ask the 🔪 {+ Chef-Runner +}, 🏆 {+ Quality +}, and 🐘 {+ Database-Wrangler +} to perform their preflight tasks
1. [ ] 🐺 {+Coordinator+}: Pick a date and time for the failover itself that won't interfere with the release team's work.
1. [ ] 🐺 {+Coordinator+}: Verify with RMs for the next release that the chosen date is OK
1. [ ] 🐺 {+Coordinator+}: [Create a new issue in the tracker using the "failover" template](https://dev.gitlab.org/gitlab-com/migration/issues/new?issuable_template=failover)
......
......@@ -76,14 +76,7 @@ class Check
end
def check_ssh
result = popen(%W[
ssh -o UserKnownHostsFile=/dev/null \
-o StrictHostKeyChecking=no \
-oConnectTimeout=5 \
-p #{ssh_port} \
-q -T \
git@#{hostname}
])
result = popen(%W[ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -oConnectTimeout=5 -p #{ssh_port} -q -T git@#{hostname}])
if result =~ /Welcome to GitLab/
"Yes"
......@@ -119,7 +112,9 @@ class Check
return nil unless result
orgname = result.lines.find { |l| l =~ /OrgName:/ }
orgname&.split(":", 2)[1].strip
return nil unless orgname
orgname.split(":", 2)[1].strip
end
def curl(url)
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment