Commit 48caeb10 authored by Andrew Newdigate's avatar Andrew Newdigate
Browse files

Structure only

parent 7fb32ab3
...@@ -9,7 +9,7 @@ shellcheck: ...@@ -9,7 +9,7 @@ shellcheck:
before_script: before_script:
- wget https://storage.googleapis.com/shellcheck/shellcheck-stable.linux.x86_64.tar.xz -O - | xzcat | tar -xv - wget https://storage.googleapis.com/shellcheck/shellcheck-stable.linux.x86_64.tar.xz -O - | xzcat | tar -xv
script: script:
- find ./bin/azure ./bin/gcp -name '*.sh' | xargs ./shellcheck-stable/shellcheck -x - find ./bin/scripts/ -name '*.sh' | xargs ./shellcheck-stable/shellcheck -x
- ./shellcheck-stable/shellcheck -x ./bin/check-script-references ./bin/workflow-script-commons.sh ./bin/source_vars_template.sh ./bin/start-failover-procedure.sh - ./shellcheck-stable/shellcheck -x ./bin/check-script-references ./bin/workflow-script-commons.sh ./bin/source_vars_template.sh ./bin/start-failover-procedure.sh
references: references:
......
...@@ -89,13 +89,13 @@ These dashboards might be useful during the failover: ...@@ -89,13 +89,13 @@ These dashboards might be useful during the failover:
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd * GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) # **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523 1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!! 1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) # ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286 1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google 1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google
...@@ -111,15 +111,15 @@ These dashboards might be useful during the failover: ...@@ -111,15 +111,15 @@ These dashboards might be useful during the failover:
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world 1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (Date TBD) # T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist 1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`. 1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `./bin/azure/02_failover/t-1d/010_gitlab_twitter_announcement.sh` - Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details 1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `./bin/azure/02_failover/t-1d/020_gitlabstatus_twitter_announcement.sh` - Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (Date TBD) # T minus 3 hours (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/040_t-3h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover **STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
...@@ -128,7 +128,7 @@ These dashboards might be useful during the failover: ...@@ -128,7 +128,7 @@ These dashboards might be useful during the failover:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)` * `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
# T minus 1 hour (Date TBD) # T minus 1 hour (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/050_t-1h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover **STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
...@@ -140,7 +140,7 @@ an hour before the scheduled maintenance window. ...@@ -140,7 +140,7 @@ an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus` 1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__` - `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__`
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack: 1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `./bin/azure/02_failover/t-1h/020_slack_announcement.sh` - `/opt/gitlab-migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now. 1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s): 1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd` - `environment`: `prd`
...@@ -188,7 +188,7 @@ an hour before the scheduled maintenance window. ...@@ -188,7 +188,7 @@ an hour before the scheduled maintenance window.
``` ```
# T minus zero (failover day) (Date TBD) # T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now. We expect the maintenance window to last for up to 2 hours, starting from now.
...@@ -245,7 +245,7 @@ you see something happening that shouldn't be public, mention it. ...@@ -245,7 +245,7 @@ you see something happening that shouldn't be public, mention it.
### Prevent updates to the primary ### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary #### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else 1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging * Staging
...@@ -266,7 +266,7 @@ you see something happening that shouldn't be public, mention it. ...@@ -266,7 +266,7 @@ you see something happening that shouldn't be public, mention it.
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost. Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure #### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes 1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'` * Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
...@@ -310,7 +310,7 @@ state of the secondary to converge. ...@@ -310,7 +310,7 @@ state of the secondary to converge.
## Finish replicating and verifying all data ## Finish replicating and verifying all data
#### Phase 3: Draining #### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [ ] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data): 1. [ ] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
* [ ] CI traces in Redis * [ ] CI traces in Redis
...@@ -365,7 +365,7 @@ of errors while it is being promoted. ...@@ -365,7 +365,7 @@ of errors while it is being promoted.
## Promote the secondary ## Promote the secondary
#### Phase 4: Reconfiguration, Part 1 #### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP 1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging` * Staging: `bin/snapshot-dbs staging`
...@@ -476,8 +476,7 @@ of errors while it is being promoted. ...@@ -476,8 +476,7 @@ of errors while it is being promoted.
## During-Blackout QA ## During-Blackout QA
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
#### Phase 5: Verification, Part 1
The details of the QA tasks are listed in the test plan document. The details of the QA tasks are listed in the test plan document.
...@@ -488,7 +487,7 @@ The details of the QA tasks are listed in the test plan document. ...@@ -488,7 +487,7 @@ The details of the QA tasks are listed in the test plan document.
## Evaluation of QA results - **Decision Point** ## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment #### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
...@@ -528,7 +527,7 @@ unexpected ways. ...@@ -528,7 +527,7 @@ unexpected ways.
## Complete the Migration (T plus 2 hours) ## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing #### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922)) 1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [ ] `emails_on_push` queue 1. [ ] `emails_on_push` queue
...@@ -542,7 +541,7 @@ unexpected ways. ...@@ -542,7 +541,7 @@ unexpected ways.
1. [ ] Ensure you receive the email 1. [ ] Ensure you receive the email
#### Phase 8: Reconfiguration, Part 2 #### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr 1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time - [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
...@@ -567,14 +566,14 @@ unexpected ways. ...@@ -567,14 +566,14 @@ unexpected ways.
* In a Rails console, run: * In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)` * `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate #### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically) 1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus` 1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube` - `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
#### Phase 10: Verification, Part 2 #### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
1. **Start After-Blackout QA** This is the second half of the test plan. 1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded 1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
......
...@@ -177,8 +177,14 @@ Each [team](https://about.gitlab.com/team/chart/) involved in the effort has a l ...@@ -177,8 +177,14 @@ Each [team](https://about.gitlab.com/team/chart/) involved in the effort has a l
## Preparing for a Failover Run ## Preparing for a Failover Run
1. **Setup `bin/source_vars`**: `cp ./bin/source_vars_template.sh ./bin/source_vars` Before a failover, the coordinator needs to login to the deploy host:
1. **Configure `bin/source_vars`**: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's `.gitignore`'d) * `deploy-01-sv-gprd.c.gitlab-production.internal` for production
1. **Setup the workflow issues**": Run `bin/start-failover-procedure.sh`. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc. * `deploy-01-sv-gstg.c.gitlab-staging-1.internal` for staging
* Any variables in the template in the format `__VARIABLE__` will be substituted with their values from the `bin/source_vars` file, saving manual effort.
Then carry out the following steps:
1. **Setup `bin/source_vars`**: `test -f /opt/gitlab-migration/bin/source_vars || sudo cp /opt/gitlab-migration/bin/source_vars_template.sh /opt/gitlab-migration/bin/source_vars`
1. **Configure `vi /opt/gitlab-migration/bin/source_vars`**: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's `.gitignore`'d)
1. **Verify `/opt/gitlab-migration/bin/verify-failover-config`**: You should receive a message indicating success
1. **Setup the workflow issues**": Run `/opt/gitlab-migration/bin/start-failover-procedure.sh`. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc.
* Any variables in the template in the format `__VARIABLE__` will be substituted with their values from the `bin/source_vars` file, saving manual effort.
...@@ -6,7 +6,8 @@ ROOT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )/.." && pwd )" ...@@ -6,7 +6,8 @@ ROOT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )/.." && pwd )"
ISSUE_TEMPLATES_DIR=${ROOT_DIR}/.gitlab/issue_templates ISSUE_TEMPLATES_DIR=${ROOT_DIR}/.gitlab/issue_templates
function find_script_ref() { function find_script_ref() {
grep -Eho "\`./bin.*?\`" "${ISSUE_TEMPLATES_DIR}"/*.md|cut -d\` -f2|cut -d" " -f1|uniq # shellcheck disable=SC2016
grep -Eho "\`/opt/gitlab-migration/bin.*?\`" .gitlab/issue_templates/*.md|sed -E 's#`/opt/gitlab-migration/|`##g'|sort -u
} }
find_script_ref | while IFS='' read -r file; do find_script_ref | while IFS='' read -r file; do
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment