Verified Commit 4d1ae181 authored by Nick Thomas's avatar Nick Thomas
Browse files

Address some feedback from 2018-07-28 failover attempt

parent be2203b4
...@@ -114,15 +114,21 @@ These dashboards might be useful during the failover: ...@@ -114,15 +114,21 @@ These dashboards might be useful during the failover:
# T minus 1 day (Date TBD) # T minus 1 day (Date TBD)
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist 1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab` 1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`
- `Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow, from START_TIME - END_TIME UTC. Follow @gitlabstatus for more details. LINK_TO_BLOG_POST` - `Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow, from START_TIME - END_TIME UTC. Follow @gitlabstatus for more details. LINK_TO_BLOG_POST`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details 1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- `Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow. We'll be live on YouTube. Working doc: GOOGLE_DOC_LINK, Blog: LINK_TO_BLOG_POST` - `Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow. We'll be live on YouTube. Working doc: GOOGLE_DOC_LINK, Blog: LINK_TO_BLOG_POST`
# T minus 3 hours (Date TBD)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
# T minus 1 hour (Date TBD) # T minus 1 hour (Date TBD)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover **STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
...@@ -223,7 +229,7 @@ you see something happening that shouldn't be public, mention it. ...@@ -223,7 +229,7 @@ you see something happening that shouldn't be public, mention it.
- [ ] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without. - [ ] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io` * Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
* Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe{01..14}.lb.gitlab.com altssh{01..02}.lb.gitlab.com` * Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe{01..16}.lb.gitlab.com altssh{01..02}.lb.gitlab.com`
### Health check ### Health check
...@@ -255,21 +261,29 @@ you see something happening that shouldn't be public, mention it. ...@@ -255,21 +261,29 @@ you see something happening that shouldn't be public, mention it.
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH * This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'` * Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
* Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'` * Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
* Production: `knife ssh "role:gitlab-base-be-mailroom OR role:gprd-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
1. [ ] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked 1. [ ] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post * Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost. Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Sidekiq Shutdown in Azure #### Phase 2: Commence Shutdown in Azure
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
* Production: `knife ssh "role:gitlab-base-be-mailroom OR role:gprd-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
1. [ ] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
* `knife ssh roles:gitlab-base-be-sidekiq-pullmirror "sudo gitlab-ctl stop sidekiq-cluster"`
1. [ ] 🐺 {+ Coordinator +}: Disable Sidekiq crons that may cause updates on the primary 1. [ ] 🐺 {+ Coordinator +}: Disable Sidekiq crons that may cause updates on the primary
* In a separate rails console on the **primary**: * In a separate rails console on the **primary**:
* `loop { Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!); sleep 1 }` * `loop { Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!); sleep 1 }`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop * The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
* Expand the `Verification Info` tab
* Wait for the number of `unverified` repositories to reach 0
* Resolve any repositories that have `failed` verification
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary 1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Staging: https://staging.gitlab.com/admin/background_jobs * Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs * Production: https://gitlab.com/admin/background_jobs
......
...@@ -33,7 +33,7 @@ function rev_name() { ...@@ -33,7 +33,7 @@ function rev_name() {
} }
function ssh_port_open() { function ssh_port_open() {
result=$(ssh -oConnectTimeout=5 -p "$2" "git@$1" 2>/dev/null) result=$(ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -oConnectTimeout=5 -p "$2" "git@$1" 2>/dev/null)
if [[ "$result" =~ "Welcome to GitLab" ]]; then if [[ "$result" =~ "Welcome to GitLab" ]]; then
echo Yes echo Yes
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment