Commit 3b2e3e0b authored by Andrew Newdigate's avatar Andrew Newdigate

Merge branch 'structure-only' into 'master'

Structure only

See merge request gitlab-com/migration!171
parents 7fb32ab3 48caeb10
Pipeline #88479 passed with stage
in 26 seconds
......@@ -9,7 +9,7 @@ shellcheck:
before_script:
- wget https://storage.googleapis.com/shellcheck/shellcheck-stable.linux.x86_64.tar.xz -O - | xzcat | tar -xv
script:
- find ./bin/azure ./bin/gcp -name '*.sh' | xargs ./shellcheck-stable/shellcheck -x
- find ./bin/scripts/ -name '*.sh' | xargs ./shellcheck-stable/shellcheck -x
- ./shellcheck-stable/shellcheck -x ./bin/check-script-references ./bin/workflow-script-commons.sh ./bin/source_vars_template.sh ./bin/start-failover-procedure.sh
references:
......
......@@ -89,13 +89,13 @@ These dashboards might be useful during the failover:
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD)
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# ** PRODUCTION ONLY** T minus 1 week (Date TBD)
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google
......@@ -111,15 +111,15 @@ These dashboards might be useful during the failover:
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (Date TBD)
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `./bin/azure/02_failover/t-1d/010_gitlab_twitter_announcement.sh`
- Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `./bin/azure/02_failover/t-1d/020_gitlabstatus_twitter_announcement.sh`
- Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (Date TBD)
# T minus 3 hours (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/040_t-3h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
......@@ -128,7 +128,7 @@ These dashboards might be useful during the failover:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
# T minus 1 hour (Date TBD)
# T minus 1 hour (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/050_t-1h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
......@@ -140,7 +140,7 @@ an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__`
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `./bin/azure/02_failover/t-1h/020_slack_announcement.sh`
- `/opt/gitlab-migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
......@@ -188,7 +188,7 @@ an hour before the scheduled maintenance window.
```
# T minus zero (failover day) (Date TBD)
# T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
......@@ -245,7 +245,7 @@ you see something happening that shouldn't be public, mention it.
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging
......@@ -266,7 +266,7 @@ you see something happening that shouldn't be public, mention it.
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
......@@ -310,7 +310,7 @@ state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [ ] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
* [ ] CI traces in Redis
......@@ -365,7 +365,7 @@ of errors while it is being promoted.
## Promote the secondary
#### Phase 4: Reconfiguration, Part 1
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
......@@ -476,8 +476,7 @@ of errors while it is being promoted.
## During-Blackout QA
#### Phase 5: Verification, Part 1
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
The details of the QA tasks are listed in the test plan document.
......@@ -488,7 +487,7 @@ The details of the QA tasks are listed in the test plan document.
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
......@@ -528,7 +527,7 @@ unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [ ] `emails_on_push` queue
......@@ -542,7 +541,7 @@ unexpected ways.
1. [ ] Ensure you receive the email
#### Phase 8: Reconfiguration, Part 2
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Convert the WAL-E node to a standby node in repmgr
- [ ] Run `gitlab-ctl repmgr standby setup PRIMARY_FQDN` - This will take a long time
......@@ -567,14 +566,14 @@ unexpected ways.
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
#### Phase 10: Verification, Part 2
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
......
......@@ -177,8 +177,14 @@ Each [team](https://about.gitlab.com/team/chart/) involved in the effort has a l
## Preparing for a Failover Run
1. **Setup `bin/source_vars`**: `cp ./bin/source_vars_template.sh ./bin/source_vars`
1. **Configure `bin/source_vars`**: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's `.gitignore`'d)
1. **Setup the workflow issues**": Run `bin/start-failover-procedure.sh`. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc.
* Any variables in the template in the format `__VARIABLE__` will be substituted with their values from the `bin/source_vars` file, saving manual effort.
Before a failover, the coordinator needs to login to the deploy host:
* `deploy-01-sv-gprd.c.gitlab-production.internal` for production
* `deploy-01-sv-gstg.c.gitlab-staging-1.internal` for staging
Then carry out the following steps:
1. **Setup `bin/source_vars`**: `test -f /opt/gitlab-migration/bin/source_vars || sudo cp /opt/gitlab-migration/bin/source_vars_template.sh /opt/gitlab-migration/bin/source_vars`
1. **Configure `vi /opt/gitlab-migration/bin/source_vars`**: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's `.gitignore`'d)
1. **Verify `/opt/gitlab-migration/bin/verify-failover-config`**: You should receive a message indicating success
1. **Setup the workflow issues**": Run `/opt/gitlab-migration/bin/start-failover-procedure.sh`. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc.
* Any variables in the template in the format `__VARIABLE__` will be substituted with their values from the `bin/source_vars` file, saving manual effort.
......@@ -6,7 +6,8 @@ ROOT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )/.." && pwd )"
ISSUE_TEMPLATES_DIR=${ROOT_DIR}/.gitlab/issue_templates
function find_script_ref() {
grep -Eho "\`./bin.*?\`" "${ISSUE_TEMPLATES_DIR}"/*.md|cut -d\` -f2|cut -d" " -f1|uniq
# shellcheck disable=SC2016
grep -Eho "\`/opt/gitlab-migration/bin.*?\`" .gitlab/issue_templates/*.md|sed -E 's#`/opt/gitlab-migration/|`##g'|sort -u
}
find_script_ref | while IFS='' read -r file; do
......
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
# shellcheck disable=SC1091,SC1090
source "${SCRIPT_DIR}/workflow-script-commons.sh"
# --------------------------------------------------------------
echo "Success, you're ready to failover."
......@@ -38,33 +38,43 @@ function gnu_readlink() {
}
function ensure_valid() {
source_vars_file=$(gnu_readlink -f "${SOURCE_VARS_DIR}/source_vars")
grep -Eho '(\w+)="__REQUIRED__"' ./bin/source_vars_template.sh |cut -d= -f1 | while read -r i; do
if [[ ${!i:=__REQUIRED__} = "__REQUIRED__" ]]; then
die "Variable ${i} has not been configured. You may need to update your 'source_vars'"
die "Variable ${i} has not been configured. You may need to update ${source_vars_file}"
fi
done
case ${FAILOVER_ENVIRONMENT} in
"prd") ;;
"stg") ;;
*) die "Unknown environment. Must be 'prd' or 'std': update ${source_vars_file}";;
esac
FAILOVER_DATE=$(gnu_date --date="$FAILOVER_DATE" "+%Y-%m-%d")
TODAY=$(gnu_date "+%Y-%m-%d")
if [[ "${FAILOVER_DATE}" < "${TODAY}" ]]; then
die "Failover date is in the past ${FAILOVER_DATE}. Have you updated 'source_vars'?"
die "Failover date is in the past ${FAILOVER_DATE}. Have you updated ${source_vars_file}?"
fi
case $(hostname -f) in
"deploy.gitlab.com")
"deploy-01-sv-gprd.c.gitlab-production.internal")
if [[ ${FAILOVER_ENVIRONMENT} != "prd" ]]; then
die "FAILOVER_ENVIRONMENT is ${FAILOVER_ENVIRONMENT}, but environment is detected as production. Have you updated 'source_vars'?"
die "FAILOVER_ENVIRONMENT is ${FAILOVER_ENVIRONMENT}, but environment is detected as production. Have you updated ${source_vars_file}?"
fi
;;
"deploy.stg.gitlab.com")
"deploy-01-sv-gstg.c.gitlab-staging-1.internal")
if [[ ${FAILOVER_ENVIRONMENT} != "stg" ]]; then
die "FAILOVER_ENVIRONMENT is ${FAILOVER_ENVIRONMENT}, but environment is detected as staging. Have you updated 'source_vars'?"
die "FAILOVER_ENVIRONMENT is ${FAILOVER_ENVIRONMENT}, but environment is detected as staging. Have you updated ${source_vars_file}?"
fi
;;
*)
if [[ ${SKIP_HOST_CHECK:=} != "true" ]]; then
die "Unknown host: please run this from a deploy host "
case ${FAILOVER_ENVIRONMENT} in
"prd") die "Unrecognised $(hostname -f): please run this from the GCP deploy host: deploy-01-sv-gprd.c.gitlab-production.internal. Set SKIP_HOST_CHECK=true if you know what you're doing." ;;
"stg") die "Unrecognised $(hostname -f): please run this from the GCP deploy host: deploy-01-sv-gstg.c.gitlab-staging-1.internal. Set SKIP_HOST_CHECK=true if you know what you're doing." ;;
*) die "Unrecognised $(hostname -f). Please review ${source_vars_file}" ;;
esac
fi
;;
esac
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment