Commit 4d374cdf authored by Matteo Melli's avatar Matteo Melli
Browse files

Merge remote-tracking branch 'origin/master' into

database_wrangler_master

Conflicts:
	.gitlab/issue_templates/failover.md
parents c53f5c19 31e02d3a
......@@ -9,7 +9,7 @@ shellcheck:
before_script:
- wget https://storage.googleapis.com/shellcheck/shellcheck-stable.linux.x86_64.tar.xz -O - | xzcat | tar -xv
script:
- find ./bin/azure ./bin/gcp -name '*.sh' | xargs ./shellcheck-stable/shellcheck -x
- find ./bin/scripts/ -name '*.sh' | xargs ./shellcheck-stable/shellcheck -x
- ./shellcheck-stable/shellcheck -x ./bin/check-script-references ./bin/workflow-script-commons.sh ./bin/source_vars_template.sh ./bin/start-failover-procedure.sh
references:
......@@ -18,3 +18,4 @@ references:
- apk add --no-cache bash
script:
- bash -x ./bin/check-script-references
- ./bin/sanity-check-scripts
......@@ -89,13 +89,13 @@ These dashboards might be useful during the failover:
* GCP: https://dashboards.gitlab.net/d/YoKVGxSmk/gcp-failover-gcp?orgId=1&var-environment=gprd
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD)
# **PRODUCTION ONLY** T minus 3 weeks (Date TBD) [📁](bin/scripts/02_failover/010_t-3w)
1. [x] Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523
1. [ ] Ensure this issue has been created on `dev.gitlab.org`, since `gitlab.com` will be unavailable during the real failover!!!
# ** PRODUCTION ONLY** T minus 1 week (Date TBD)
# ** PRODUCTION ONLY** T minus 1 week (Date TBD) [📁](bin/scripts/02_failover/020_t-1w)
1. [x] 🔪 {+ Chef-Runner +}: Scale up the `gprd` fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286
1. [ ] ☎ {+ Comms-Handler +}: communicate date to Google
......@@ -111,15 +111,15 @@ These dashboards might be useful during the failover:
1. [ ] 🔪 {+ Chef-Runner +}: Ensure the GCP environment is inaccessible to the outside world
# T minus 1 day (Date TBD)
# T minus 1 day (Date TBD) [📁](bin/scripts/02_failover/030_t-1d)
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`.
- Tweet content from `./bin/azure/02_failover/t-1d/010_gitlab_twitter_announcement.sh`
- Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/010_gitlab_twitter_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- Tweet content from `./bin/azure/02_failover/t-1d/020_gitlabstatus_twitter_announcement.sh`
- Tweet content from `/opt/gitlab-migration/bin/scripts/02_failover/030_t-1d/020_gitlabstatus_twitter_announcement.sh`
# T minus 3 hours (Date TBD)
# T minus 3 hours (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/040_t-3h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
......@@ -128,7 +128,7 @@ These dashboards might be useful during the failover:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
# T minus 1 hour (Date TBD)
# T minus 1 hour (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/050_t-1h)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
......@@ -140,7 +140,7 @@ an hour before the scheduled maintenance window.
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `As part of upcoming GitLab.com maintenance work, CI runners will not be accepting new jobs until __MAINTENANCE_END_TIME__ UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: __GOOGLE_DOC_URL__`
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
- `./bin/azure/02_failover/t-1h/020_slack_announcement.sh`
- `/opt/gitlab-migration/bin/scripts/02_failover/050_t-1h/020_slack_announcement.sh`
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
......@@ -169,31 +169,26 @@ an hour before the scheduled maintenance window.
* `sudo crontab -e` to get an editor window, comment out the line involving rsync
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after now will be lost.
* Updates to pages after the transfer starts will be lost.
* The user running the rsync _must_ have full sudo access on both azure and gcp pages.
* Very manual, looks a little like the following at present:
* Before you run the commands below, ensure that the ssh key used to ssh to the pages VMs are in your ssh-agent:
```
ssh-add -l # to list keys
ssh-add path/to/ssh/key # if you do not have the key loaded
```
* Staging:
```
ssh -A 10.124.2.8 # nfs5.staging.gitlab.com
ssh 10.133.2.161 # nfs-pages-staging-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 15 -n 1 sudo SSH_AUTH_SOCK=$SSH_AUTH_SOCK rsync -avh -e "ssh -oCompression=no" --rsync-path="sudo rsync" /var/opt/gitlab/gitlab-rails/shared/pages/{} $USER@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/stg_pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gstg.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
* Production:
```
ssh -A 10.70.2.161 # nfs-pages-01.stor.gitlab.com
ssh 10.70.2.161 # nfs-pages-01.stor.gitlab.com
tmux
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 15 -n 1 sudo SSH_AUTH_SOCK=$SSH_AUTH_SOCK rsync -avh -e "ssh -oCompression=no" --rsync-path="sudo rsync" /var/opt/gitlab/gitlab-rails/shared/pages/{} $USER@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
sudo ls -1 /var/opt/gitlab/gitlab-rails/shared/pages | xargs -I {} -P 25 -n 1 sudo rsync -avh -e "ssh -i /root/.ssh/pages_sync -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o Compression=no" /var/opt/gitlab/gitlab-rails/shared/pages/{} git@pages.stor.gprd.gitlab.net:/var/opt/gitlab/gitlab-rails/shared/pages
```
# T minus zero (failover day) (Date TBD)
# T minus zero (failover day) (__FAILOVER_DATE__) [📁](bin/scripts/02_failover/060_go/)
We expect the maintenance window to last for up to 2 hours, starting from now.
......@@ -250,7 +245,7 @@ you see something happening that shouldn't be public, mention it.
### Prevent updates to the primary
#### Phase 1: Block non-essential network access to the primary
#### Phase 1: Block non-essential network access to the primary [📁](bin/scripts/02_failover/060_go/p01)
1. [ ] 🔪 {+ Chef-Runner +}: Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else
* Staging
......@@ -271,7 +266,7 @@ you see something happening that shouldn't be public, mention it.
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Shutdown in Azure
#### Phase 2: Commence Shutdown in Azure [📁](bin/scripts/02_failover/060_go/p02)
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
......@@ -315,7 +310,7 @@ state of the secondary to converge.
## Finish replicating and verifying all data
#### Phase 3: Draining
#### Phase 3: Draining [📁](bin/scripts/02_failover/060_go/p03)
1. [ ] 🐺 {+ Coordinator +}: Ensure any data not replicated by Geo is replicated manually. We know about [these](https://docs.gitlab.com/ee/administration/geo/replication/index.html#examples-of-unreplicated-data):
* [ ] CI traces in Redis
......@@ -370,7 +365,7 @@ of errors while it is being promoted.
## Promote the secondary
#### Phase 4: Reconfiguration, Part 1
#### Phase 4: Reconfiguration, Part 1 [📁](bin/scripts/02_failover/060_go/p04)
1. [ ] ☁ {+ Cloud-conductor +}: Incremental snapshot of database disks in case of failback in Azure and GCP
* Staging: `bin/snapshot-dbs staging`
......@@ -408,7 +403,6 @@ of errors while it is being promoted.
1. [ ] 🔪 {+ Chef-Runner +}: Run `chef-client` on every node to ensure Chef changes are applied and all Geo secondary services are stopped
* **STAGING** `knife ssh roles:gstg-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
* **PRODUCTION** **UNTESTED** `knife ssh roles:gprd-base 'sudo chef-client > /tmp/chef-client-log-$(date +%s).txt 2>&1 || echo FAILED'`
1. [ ] 🔪 {+ Chef-Runner +}: Ensure that `gitlab.rb` has the correct `external_url` on all hosts
* Staging: `knife ssh roles:gstg-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
* Production: `knife ssh roles:gprd-base 'sudo cat /etc/gitlab/gitlab.rb 2>/dev/null | grep external_url' | sort -k 2`
......@@ -445,8 +439,7 @@ of errors while it is being promoted.
## During-Blackout QA
#### Phase 5: Verification, Part 1
#### Phase 5: Verification, Part 1 [📁](bin/scripts/02_failover/060_go/p05)
The details of the QA tasks are listed in the test plan document.
......@@ -457,7 +450,7 @@ The details of the QA tasks are listed in the test plan document.
## Evaluation of QA results - **Decision Point**
#### Phase 6: Commitment
#### Phase 6: Commitment [📁](bin/scripts/02_failover/060_go/p06)
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the 🐺 {+ Coordinator +} must decide whether to continue with the
......@@ -497,7 +490,7 @@ unexpected ways.
## Complete the Migration (T plus 2 hours)
#### Phase 7: Restart Mailing
#### Phase 7: Restart Mailing [📁](bin/scripts/02_failover/060_go/p07)
1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [ ] `emails_on_push` queue
......@@ -511,7 +504,7 @@ unexpected ways.
1. [ ] Ensure you receive the email
#### Phase 8: Reconfiguration, Part 2
#### Phase 8: Reconfiguration, Part 2 [📁](bin/scripts/02_failover/060_go/p08)
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Ensure priority is updated in repmgr configuration
- [ ] Update in chef cookbooks by removing the setting entirely
......@@ -534,14 +527,14 @@ unexpected ways.
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
#### Phase 9: Communicate
#### Phase 9: Communicate [📁](bin/scripts/02_failover/060_go/p09)
1. [ ] 🐺 {+ Coordinator +}: Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlabstatus`
- `GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We're live on YouTube`
#### Phase 10: Verification, Part 2
#### Phase 10: Verification, Part 2 [📁](bin/scripts/02_failover/060_go/p10)
1. **Start After-Blackout QA** This is the second half of the test plan.
1. [ ] 🏆 {+ Quality +}: Ensure all "after the blackout" QA automated tests have succeeded
......
......@@ -121,7 +121,7 @@
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987)
* [ ] [Make GCP accessible to the outside world - TODO]()
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2322)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
......
......@@ -177,8 +177,14 @@ Each [team](https://about.gitlab.com/team/chart/) involved in the effort has a l
## Preparing for a Failover Run
1. **Setup `bin/source_vars`**: `cp ./bin/source_vars_template.sh ./bin/source_vars`
1. **Configure `bin/source_vars`**: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's `.gitignore`'d)
1. **Setup the workflow issues**": Run `bin/start-failover-procedure.sh`. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc.
* Any variables in the template in the format `__VARIABLE__` will be substituted with their values from the `bin/source_vars` file, saving manual effort.
Before a failover, the coordinator needs to login to the deploy host:
* `deploy-01-sv-gprd.c.gitlab-production.internal` for production
* `deploy-01-sv-gstg.c.gitlab-staging-1.internal` for staging
Then carry out the following steps:
1. **Setup `bin/source_vars`**: `test -f /opt/gitlab-migration/bin/source_vars || sudo cp /opt/gitlab-migration/bin/source_vars_template.sh /opt/gitlab-migration/bin/source_vars`
1. **Configure `vi /opt/gitlab-migration/bin/source_vars`**: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's `.gitignore`'d)
1. **Verify `/opt/gitlab-migration/bin/verify-failover-config`**: You should receive a message indicating success
1. **Setup the workflow issues**": Run `/opt/gitlab-migration/bin/start-failover-procedure.sh`. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc.
* Any variables in the template in the format `__VARIABLE__` will be substituted with their values from the `bin/source_vars` file, saving manual effort.
......@@ -6,7 +6,8 @@ ROOT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )/.." && pwd )"
ISSUE_TEMPLATES_DIR=${ROOT_DIR}/.gitlab/issue_templates
function find_script_ref() {
grep -Eho "\`./bin.*?\`" "${ISSUE_TEMPLATES_DIR}"/*.md|cut -d\` -f2|cut -d" " -f1|uniq
# shellcheck disable=SC2016
grep -Eho "\`/opt/gitlab-migration/bin.*?\`" .gitlab/issue_templates/*.md|sed -E 's#`/opt/gitlab-migration/|`##g'|sort -u
}
find_script_ref | while IFS='' read -r file; do
......
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
ROOT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
function find_scripts() {
find "${ROOT_DIR}/scripts" -type f -name "*.sh"
}
find_scripts | while IFS='' read -r file; do
# Ensures all the ../.. references are correct....
if [[ -x ${file} ]]; then
echo "${file}"
SANITY_CHECK_ONLY=1 "${file}"
fi
done
......@@ -4,11 +4,13 @@ set -euo pipefail
IFS=$'\n\t'
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
UNSYMLINKED_SCRIPT_DIR="$(greadlink -f "${SCRIPT_DIR}" || readlink "${SCRIPT_DIR}" || echo "${SCRIPT_DIR}")"
# shellcheck disable=SC1091,SC1090
source "${SCRIPT_DIR}/../../../workflow-script-commons.sh"
source "${UNSYMLINKED_SCRIPT_DIR}/../../../workflow-script-commons.sh"
# --------------------------------------------------------------
PRODUCTION_ONLY
cat <<EOD
......
......@@ -4,8 +4,9 @@ set -euo pipefail
IFS=$'\n\t'
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
UNSYMLINKED_SCRIPT_DIR="$(greadlink -f "${SCRIPT_DIR}" || readlink "${SCRIPT_DIR}" || echo "${SCRIPT_DIR}")"
# shellcheck disable=SC1091,SC1090
source "${SCRIPT_DIR}/../../../workflow-script-commons.sh"
source "${UNSYMLINKED_SCRIPT_DIR}/../../../workflow-script-commons.sh"
# --------------------------------------------------------------
......
......@@ -4,8 +4,9 @@ set -euo pipefail
IFS=$'\n\t'
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
UNSYMLINKED_SCRIPT_DIR="$(greadlink -f "${SCRIPT_DIR}" || readlink "${SCRIPT_DIR}" || echo "${SCRIPT_DIR}")"
# shellcheck disable=SC1091,SC1090
source "${SCRIPT_DIR}/../../../workflow-script-commons.sh"
source "${UNSYMLINKED_SCRIPT_DIR}/../../../workflow-script-commons.sh"
# --------------------------------------------------------------
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment