Commit 95fb3e36 authored by Nikolay's avatar Nikolay
Browse files

Merge branch 'master' of gitlab.com:gitlab-com/migration into db_steps_automation

parents e41380f5 b2288b4c
......@@ -86,6 +86,7 @@ working order, including database replication between the two sites.
1. [ ] ↩️ {+ Fail-back Handler +}: Set the GitLab shared runner timeout back to 3 hours
1. [ ] ↩️ {+ Fail-back Handler +}: Restart automatic incremental GitLab Pages sync
* Enable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, uncomment the line involving rsync
1. [ ] ↩️ {+ Fail-back Handler +}: Update GitLab shared runners to expire jobs after 3 hours
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 10800)`
......
......@@ -114,15 +114,21 @@ These dashboards might be useful during the failover:
# T minus 1 day (Date TBD)
1. [ ] 🐺 {+ Coordinator +}: Perform (or coordinate) Preflight Checklist
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Tweet from `@gitlab`
- `Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow, from START_TIME - END_TIME UTC. Follow @gitlabstatus for more details. LINK_TO_BLOG_POST`
1. [ ] **PRODUCTION ONLY** ☎ {+ Comms-Handler +}: Retweet `@gitlab` tweet from `@gitlabstatus` with further details
- `Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow. We'll be live on YouTube. Working doc: GOOGLE_DOC_LINK, Blog: LINK_TO_BLOG_POST`
# T minus 3 hours (Date TBD)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 3 hours before failover
1. [ ] 🐺 {+ Coordinator +}: Update GitLab shared runners to expire jobs after 1 hour
* In a Rails console, run:
* `Ci::Runner.instance_type.update_all(maximum_timeout: 3600)`
# T minus 1 hour (Date TBD)
**STAGING FAILOVER TESTING ONLY**: to speed up testing, this step can be done less than 1 hour before failover
......@@ -137,12 +143,18 @@ an hour before the scheduled maintenance window.
1. [ ] ☎ {+ Comms-Handler +}: Post to #announcements on Slack:
* Staging: `We're rehearsing the failover of GitLab.com in *1 hour* by migrating staging.gitlab.com to GCP. Come watch us at ZOOM_LINK! Notes in GOOGLE_DOC_LINK!`
* Production: `GitLab.com is being migrated to GCP in *1 hour*. There is a 2-hour downtime window. We'll be live on YouTube. Notes in GOOGLE_DOC_LINK!`
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: Create a maintenance window in PagerDuty for [GitLab Production service](https://gitlab.pagerduty.com/services/PATDFCE) for 2 hours starting in an hour from now.
1. [ ] **PRODUCTION ONLY** ☁ {+ Cloud-conductor +}: [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 2 hours starting in an hour from now with the following matcher(s):
- `environment`: `prd`
1. [ ] 🔪 {+ Chef-Runner +}: Stop any new GitLab CI jobs from being executed
* Block `POST /api/v4/jobs/request`
* Staging
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094
* `knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'`
* Production
* [Create an alert silence](https://alerts.gitlab.com/#/silences/new) for 3 hours (starting now) with the following matchers:
- `environment`: `prd`
- `alertname`: `High4xxApiRateLimit|High4xxRateLimit`, check "Regex"
* https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243
* `knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'`
- [ ] ☎ {+ Comms-Handler +}: Create a broadcast message
......@@ -156,6 +168,7 @@ an hour before the scheduled maintenance window.
* Production: `bin/snapshot-dbs production`
1. [ ] 🔪 {+ Chef-Runner +}: Stop automatic incremental GitLab Pages sync
* Disable the cronjob on the **Azure** pages NFS server
* `sudo crontab -e` to get an editor window, comment out the line involving rsync
1. [ ] 🔪 {+ Chef-Runner +}: Start parallelized, incremental GitLab Pages sync
* Expected to take ~30 minutes, run in screen/tmux! On the **Azure** pages NFS server!
* Updates to pages after now will be lost.
......@@ -222,7 +235,7 @@ you see something happening that shouldn't be public, mention it.
- [ ] 🐺 {+ Coordinator +}: To monitor the state of DNS, network blocking etc, run the below command on two machines - one with VPN access, one without.
* Staging: `watch -n 5 bin/hostinfo staging.gitlab.com registry.staging.gitlab.com altssh.staging.gitlab.com gitlab-org.staging.gitlab.io gstg.gitlab.com altssh.gstg.gitlab.com gitlab-org.gstg.gitlab.io`
* Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io`
* Production: `watch -n 5 bin/hostinfo gitlab.com registry.gitlab.com altssh.gitlab.com gitlab-org.gitlab.io gprd.gitlab.com altssh.gprd.gitlab.com gitlab-org.gprd.gitlab.io fe{01..16}.lb.gitlab.com altssh{01..02}.lb.gitlab.com`
### Health check
......@@ -254,21 +267,29 @@ you see something happening that shouldn't be public, mention it.
* This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
* Staging: `knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'`
* Production: `knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'`
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
* Production: `knife ssh "role:gitlab-base-be-mailroom OR role:gprd-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
1. [ ] 🐺 {+ Coordinator +}: Ensure traffic from a non-VPN IP is blocked
* Check the non-VPN `hostinfo` output and verify that the SSH column reads `No` and the REDIRECT column shows it being redirected to the migration blog post
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
#### Phase 2: Commence Sidekiq Shutdown in Azure
#### Phase 2: Commence Shutdown in Azure
1. [ ] 🔪 {+ Chef-Runner +}: Stop mailroom on all the nodes
* Staging: `knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
* Production: `knife ssh "role:gitlab-base-be-mailroom OR role:gprd-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'`
1. [ ] 🔪 {+ Chef-Runner +} **PRODUCTION ONLY**: Stop `sidekiq-pullmirror` in Azure
* `knife ssh roles:gitlab-base-be-sidekiq-pullmirror "sudo gitlab-ctl stop sidekiq-cluster"`
1. [ ] 🐺 {+ Coordinator +}: Disable Sidekiq crons that may cause updates on the primary
* In a separate rails console on the **primary**:
* `loop { Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!); sleep 1 }`
* The `geo_sidekiq_cron_config` job or an RSS kill may re-enable the crons, which is why we run it in a loop
1. [ ] 🐺 {+ Coordinator +}: Wait for repository verification on the **primary** to complete
* Staging: https://staging.gitlab.com/admin/geo_nodes - `staging.gitlab.com` node
* Production: https://gitlab.com/admin/geo_nodes - `gitlab.com` node
* Expand the `Verification Info` tab
* Wait for the number of `unverified` repositories to reach 0
* Resolve any repositories that have `failed` verification
1. [ ] 🐺 {+ Coordinator +}: Wait for all Sidekiq jobs to complete on the primary
* Staging: https://staging.gitlab.com/admin/background_jobs
* Production: https://gitlab.com/admin/background_jobs
......@@ -569,7 +590,7 @@ unexpected ways.
#### Phase 7: Restart Mailing
1. [ ] 🔪 {+ Chef-Runner +}: ***PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [ ] 🔪 {+ Chef-Runner +}: **PRODUCTION ONLY** Re-enable mailing queues on sidekiq-asap (revert [chef-repo!1922](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1922))
1. [ ] `emails_on_push` queue
1. [ ] `mailers` queue
1. [ ] (`admin_emails` queue doesn't exist any more)
......@@ -589,6 +610,9 @@ unexpected ways.
- [ ] Update in chef cookbooks by removing the setting entirely
- [ ] Update in the running database
- [ ] On the primary server, run `gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'`
1. [ ] 🐘 {+ Database-Wrangler +}: **Production only** Reduce `statement_timeout` to 15s.
- [ ] Merge and chef this change: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334
- [ ] Close https://gitlab.com/gitlab-com/migration/issues/686
1. [ ] 🔪 {+ Chef-Runner +}: Convert Azure Pages IP into a proxy server to the GCP Pages LB
* Staging:
* Apply https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270
......
......@@ -115,12 +115,15 @@
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2270)
* [ ] [Make GCP accessible to the outside world](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2112)
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2333)
* Production:
* [ ] [Azure CI blocking](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2243)
* [ ] [Azure HAProxy update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2254)
* [ ] [GCP configuration update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2218)
* [ ] [Azure Pages LB update](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987)
* [ ] [Make GCP accessible to the outside world - TODO]()
* [ ] [Reduce statement timeout to 15s](https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2334)
1. [ ] 🐘 {+ Database-Wrangler +}: Ensure `gitlab-ctl repmgr cluster show` works on all database nodes
......
......@@ -33,7 +33,7 @@ function rev_name() {
}
function ssh_port_open() {
result=$(ssh -oConnectTimeout=5 -p "$2" "git@$1" 2>/dev/null)
result=$(ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -oConnectTimeout=5 -p "$2" "git@$1" 2>/dev/null)
if [[ "$result" =~ "Welcome to GitLab" ]]; then
echo Yes
......@@ -51,7 +51,7 @@ function ssh_port() {
}
function http_status() {
local status=$((curl --head --connect-timeout 5 --max-time 5 --silent ${1}|head -1|cut -d\ -f2) || echo "Error")
local status=$((curl --insecure --head --connect-timeout 5 --max-time 5 --silent ${1}|head -1|cut -d\ -f2) || echo "Error")
if [[ -z $status ]]; then
echo "Invalid"
else
......@@ -61,7 +61,7 @@ function http_status() {
function redirect() {
if ! curl --head --connect-timeout 5 --max-time 5 --silent "$1"|grep Location|sed -E 's/^.*: +//'|cut -c 1-40; then
if ! curl --insecure --head --connect-timeout 5 --max-time 5 --silent "$1"|grep Location|sed -E 's/^.*: +//'|cut -c 1-40; then
echo "-"
fi
}
......
......@@ -10,8 +10,11 @@ else
AZ_RESOURCE_GROUP="PostgresStaging"
fi
declare -r disk_list='gcloud compute --project $GCP_PROJECT disks list --filter="name~'^postgres-.*-data'" --uri'
declare -a GCP_DISKS
readarray -t GCP_DISKS < <(gcloud compute --project $GCP_PROJECT disks list --filter="name~'^postgres-.*-data'" --uri)
GCP_DISKS=($(awk -F= '{print $1}' <(eval $disk_list)))
gcloud compute --project $GCP_PROJECT disks snapshot "${GCP_DISKS[@]}" &
for disk_name in `az disk list -g $AZ_RESOURCE_GROUP -o tsv --query "[*].name"`; do
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment