2018-06-07 staging failover attempt
Failover Team
Role | Assigned To |
---|---|
|
@digitalmoksha |
|
@ahmadsherif |
|
@andrew |
|
@jarv |
|
@ahmadsherif |
|
@meks |
|
@ahmadsherif |
|
(try to ensure that
Support Options
Provider | Plan | Details | Create Ticket |
---|---|---|---|
Microsoft Azure | Profession Direct Support | 24x7, email & phone, 1 hour turnaround on Sev A | Create Azure Support Ticket |
Google Cloud Platform | Gold Support | 24x7, email & phone, 1hr response on critical issues | Create GCP Support Ticket |
PRODUCTION ONLY T minus 3 weeks (Date TBD)
-
Notify content team of upcoming announcements to give them time to prepare blog post, email content. https://gitlab.com/gitlab-com/blog-posts/issues/523 -
Ensure this issue has been created on dev.gitlab.org
, sincegitlab.com
will be unavailable during the real failover!!!
** PRODUCTION ONLY** T minus 1 week (Date TBD)
-
🔪 Chef-Runner : Scale up thegprd
fleet to production capacity: https://gitlab.com/gitlab-com/migration/issues/286 -
🐺 Coordinator : Perform Preflight Checklist: CREATE PREFLIGHT ISSUE HERE -
☎ ️ Comms-Handler : communicate date to Google -
☎ ️ Comms-Handler : announce in #general slack and on team call date of failover. -
☎ ️ Comms-Handler : Marketing team publish blog post about upcoming GCP failover -
☎ ️ Comms-Handler : Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover
- Details of our backup policies to assure users that their data is safe
- Details of specific situations with very-long running CI jobs which may loose their artifacts and logs if they don't complete before the maintenance window
-
☎ ️ Comms-Handler : Ensure that YouTube stream will be available for Zoom call -
☎ ️ Comms-Handler : Tweet blog post from@gitlab
and@gitlabstatus
- ️
Reminder: GitLab.com will be undergoing 2 hours maintenance on Saturday XX June 2018, from START_TIME - END_TIME UTC. Follow @gitlabstatus for more details. LINK_TO_BLOG_POST
- ️
-
🔪 Chef-Runner : Ensure the GCP environment is inaccessible to the outside world -
🏆 Quality-Manager : Manually verify synced attachments on secondary (this takes ~2 days)- Manually resync attachments missing files, if any
- Save upload IDs of all
missing_on_primary
to compare against after failover
T minus 1 day (Date TBD)
-
🏆 Quality-Manager : Create the QA testing issue using the template: #6 (closed) -
🐺 Coordinator : Perform (or coordinate) Preflight Checklist: #4 (closed) -
PRODUCTION ONLY UNTESTED Update GitLab shared runners to expire jobs after 1 hour -
PRODUCTION ONLY ☎ ️ Comms-Handler : Tweet from@gitlab
- ️
Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow, from START_TIME - END_TIME UTC. Follow @gitlabstatus for more details. LINK_TO_BLOG_POST
- ️
-
PRODUCTION ONLY ☎ ️ Comms-Handler : Retweet@gitlab
tweet from@gitlabstatus
with further details- ️
Reminder: GitLab.com will be undergoing 2 hours maintenance tomorrow. We'll be live on YouTube. Working doc: LINK_TO_WORKING_DOC, Blog: LINK_TO_BLOG_POST
- ️
T minus 1 hour (Date TBD)
STAGING FAILOVER TESTING ONLY: to speed up testing, this step can be done less than 1 hour before failover
GitLab runners attempting to post artifacts back to GitLab.com during the maintenance window will fail and the artifacts may be lost. To avoid this as much as possible, we'll stop any new runner jobs from being picked up, starting an hour before the scheduled maintenance window.
-
PRODUCTION ONLY ☎ ️ Comms-Handler : Tweet from@gitlabstatus
- ️
As part of upcoming GitLab.com maintenance work, CI runners will not be accepting next jobs until END_TIME UTC. GitLab.com will undergo maintenance in 1 hour. Working doc: LINK_TO_WORKING_DOC
- ️
-
☎ ️ Comms-Handler : Post to #announcements on Slack:- Staging:
We're rehearsing the failover of GitLab.com in *1 hour* by migrating staging.gitlab.com to GCP. Come watch us at ZOOM_LINK! Notes in GOOGLE_DOC_LINK!
- Production:
GitLab.com is being migrated to GCP in *1 hour*. There is a 2-hour downtime window. We'll be live on YouTube. Notes in GOOGLE_DOC_LINK!
- Staging:
-
🔪 Chef-Runner : Stop any new GitLab CI jobs from being executed- Block
POST /api/v4/jobs/request
- https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2094/diffs
- Staging:
knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'
- Production:
knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'
- Block
-
☎ ️ Comms-Handler : Create a broadcast message- Staging: https://staging.gitlab.com/admin/broadcast_messages
- Production: https://staging.gitlab.com/admin/broadcast_messages
- Text:
staging.gitlab.com is moving to a new home! Hold on to your hats, we’re going dark for approximately 2 hours from XX:XX on 2018-XX-YY
- Start date: now.
- End date: now + 2 hours
-
☁ ️ Cloud-conductor : Initial snapshot of database disks in case of failback- In Azure
- In GCP
T minus zero (failover day) (Date TBD)
We expect the maintenance window to last for up to 2 hours, starting from now.
Failover Procedure
These steps will be run in a Zoom call. The
Changes are made one at a time, and verified before moving onto the next step. Whoever is performing a change should share their screen and explain their actions as they work through them. Everyone else should watch closely for mistakes or errors! A few things to keep an especially sharp eye out for:
- Exposed credentials (except short-lived items like 2FA codes)
- Running commands against the wrong hosts
- Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the intention is for the call to be broadcast live on the day. If you see something happening that shouldn't be public, mention it.
Notify Users of Maintenance Window
-
PRODUCTION ONLY ☎ ️ Comms-Handler : Tweet from@gitlabstatus
- ️
GitLab.com will soon shutdown for planned maintenance for migration to @GCPcloud. See you on the other side! We'll be live on YouTube
- ️
Prevent updates to the primary
Phase 1: Block non-essential network access to the primary
-
🔪 Chef-Runner : Update HAProxy config to allow Geo and VPN traffic over HTTPS and drop everything else- https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2029
- Staging:
knife ssh -p 2222 roles:staging-base-lb 'sudo chef-client'
- Production:
knife ssh -p 2222 roles:gitlab-base-lb 'sudo chef-client'
-
🔪 Chef-Runner : Restart HAProxy on all LBs to terminate any on-going connections- This terminates ongoing SSH, HTTP and HTTPS connections, including AltSSH
- Staging:
knife ssh -p 2222 roles:staging-base-lb 'sudo systemctl restart haproxy'
- Production:
knife ssh -p 2222 roles:gitlab-base-lb 'sudo systemctl restart haproxy'
-
🔪 Chef-Runner : Apply HAProxy redirect changes to GCP node as well -
🔪 Chef-Runner : Stop mailroom on all the nodes- Staging:
knife ssh "role:staging-base-be-mailroom OR role:gstg-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'
- Production:
knife ssh "role:gitlab-base-be-mailroom OR role:gprd-base-be-mailroom" 'sudo gitlab-ctl stop mailroom'
- Staging:
-
🐺 Coordinator : Ensure traffic from a non-VPN IP is blocked- *PRODUCTION ONLY UNTESTED AltSSH:
ssh -p 443 git@altssh.gitlab.com
, you should not seeWelcome to GitLab, <name>
- SSH:
ssh git@<domain>
, you should not seeWelcome to GitLab, <name>
- HTTP:
curl -L http://<domain>
, you should see a 500 response or deploy page - HTTPS:
curl -L https://<domain>
, you should see a 500 response or deploy page
- *PRODUCTION ONLY UNTESTED AltSSH:
- Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
Phase 2: Commence Sidekiq Shutdown in Azure
-
🐺 Coordinator : Disable Sidekiq crons that may cause updates on the primary- In a rails console on the primary:
Sidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:disable!)
-
🐺 Coordinator : Wait for all Sidekiq jobs to complete on the primary- Navigate to https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs
- Press
Queues -> Live Poll
- Wait for all queues not mentioned above to reach 0
- Wait for the number of
Busy
jobs to reach 0 - On staging, the repository verification queue may not empty
Finish replicating and verifying all data
Phase 3: Draining
-
🐺 Coordinator : Ensure any data not replicated by Geo is replicated manually. We know about these:-
Container Registry - Hopefully this is a shared object storage bucket, in which case this can be removed
-
GitLab Pages - Set off incremental rsync command?
-
CI traces in Redis - Run
::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)
- Run
-
-
🐺 Coordinator : Wait for all repositories and wikis to become synchronized- Staging: https://gstg.gitlab.com/admin/geo_nodes
- Production: https://gprd.gitlab.com/admin/geo_nodes
- Press "Sync Information"
- Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
- If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
- On staging, this may not complete
-
🐺 Coordinator : Wait for all repositories and wikis to become verified- Press "Verification Information"
- Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
- If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
- On staging, verification may not complete
-
🐺 Coordinator : In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor" -
🐺 Coordinator : In "Sync Information", wait for "Data replication lag" to read1m
or less -
🐺 Coordinator : Now disable all sidekiq-cron jobs on the secondary- In a rails console on the secondary:
Sidekiq::Cron::Job.all.map(&:disable!)
- This may race with a
geo_sidekiq_cron_config
job. Run it until it does not
-
🐺 Coordinator : Wait for all Sidekiq jobs to complete on the secondary- Staging: Navigate to https://gstg.gitlab.com/admin/background_jobs
- Production: Navigate to https://gprd.gitlab.com/admin/background_jobs
- Press
Queues -> Live Poll
- Wait for all queues to reach 0, excepting
emails_on_push
andmailers
(which are disabled) - Wait for the number of
Enqueued
andBusy
jobs to reach 0 - Staging ONLY: some jobs (e.g.,
file_download_dispatch_worker
) may refuse to exit- This will prevent postgresql failover from completing
- To fix:
knife ssh roles:staging-base-be-sidekiq "gitlab-ctl stop sidekiq"
-
🐺 Coordinator : Handle Sidekiq jobs in the "retry" state- Staging: https://staging.gitlab.com/admin/sidekiq/retries
- Production: https://gitlab.com/admin/sidekiq/retries
- Delete jobs in idempotent or transient queues (
reactive_caching
orrepository_update_remote_mirror
, for instance) - Delete jobs in other queues that are failing due to application bugs (error contains
NoMethodError
, for instance) - Press "Retry All" to attempt to retry all remaining jobs immediately
- Repeat until 0 retries are present
At this point all data on the primary should be present in exactly the same form on the secondary. There is no outstanding work in sidekiq on the primary or secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run background synchronization operations against the primary, reducing the chance of errors while it is being promoted.
Promote the secondary
Phase 4: Reconfiguration, Part 1
-
☁ ️ Cloud-conductor : Incremental snapshot of database disks in case of failback- In Azure
- In GCP
-
☁ ️ Cloud-conductor : Update DNS entries to refer to the GCP load-balancers- Panel is https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
- Staging:
staging.gitlab.com
andregistry.staging.gitlab.com
should point togstg.gitlab.com
- Production:
-
gitlab.com
andregistry.gitlab.com
should point togprd.gitlab.com
-
*.githost.io
should point to the new GCP Pages LB
-
-
🐘 Database-Wrangler : Identify the desired primary in GCP, and update it's priority in the repmgr database. Run the following on the current primary:# gitlab-psql -d gitlab_repmgr -c "update repmgr_gitlab_cluster.repl_nodes set priority=101 where name='NEW_PRIMARY'"
-
🐘 Database-Wrangler : Gracefully turn off the Azure postgresql primary instance.- Keep everything, just ensure it’s turned off
$ knife ssh "fqdn:CURRENT_PRIMARY" "gitlab-ctl stop postgresql"
-
🐘 Database-Wrangler : After timeout of 30 seconds, repmgr should failover primary to the chosen node in GCP, and other nodes should automatically follow.-
Confirm gitlab-ctl repmgr cluster show
reflects the desired state -
Confirm pgbouncer node in GCP (Password is in 1password) $ gitlab-ctl pgb-console ... pgbouncer# SHOW DATABASES; # You want to see lines like gitlabhq_production | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 100 | 5 | | 0 | 0 gitlabhq_production_sidekiq | PRIMARY_IP_HERE | 5432 | gitlabhq_production | | 150 | 5 | | 0 | 0 ... pgbouncer# SHOW SERVERS; # You want to see lines like S | gitlab | gitlabhq_production | idle | PRIMARY_IP | 5432 | PGBOUNCER_IP | 54714 | 2018-05-11 20:59:11 | 2018-05-11 20:59:12 | 0x718ff0 | | 19430 |
-
-
🐘 Database-Wrangler : Check the database is now read-write- SQL, looking for
F
as the result:select * from pg_is_in_recovery();
- SQL, looking for
-
🔪 Chef-Runner : Update the chef configuration according to https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989 -
🔪 Chef-Runner : Runchef-client
on every node to ensure Chef changes are applied and all Geo secondary services are stopped-
STAGING
knife ssh roles:gstg-base 'sudo chef-client'
-
PRODUCTION UNTESTED
knife ssh roles:gprd-base 'sudo chef-client'
- Ensure that
gitlab.rb
has the correctexternal_url
on all hosts - Ensure that unicorn / sidekiq / etc has been restarted on all hosts
-
STAGING
-
🐺 Coordinator : Fix the Geo node hostname for the old secondary- Staging: https://gstg.gitlab.com/admin/geo_nodes, change URL of secondary to
https://azure.staging.gitlab.com
- Production: https://gprd.gitlab.com/admin/geo_nodes, change URL of secondary to
https://azure.gitlab.com
- In case the website can't be reached, issue
GeoNode.where(url: "...").update!(url: "...")
from the new primary's console.
- Staging: https://gstg.gitlab.com/admin/geo_nodes, change URL of secondary to
-
🐺 Coordinator : Clear Redis cache of promoted secondary:gitlab-rake cache:clear:redis
-
🐺 Coordinator : Flush any unwanted Sidekiq jobs on the promoted secondarySidekiq::Queue.all.select { |q| %w[emails_on_push mailers].include?(q.name) }.map(&:clear)
-
🐺 Coordinator : Re-enable sidekiq-cron jobs for the promoted secondarySidekiq::Cron::Job.all.reject { |j| ::Gitlab::Geo::CronManager::GEO_JOBS.include?(j.name) }.map(&:enable!)
During-Blackout QA
Phase 5: Verification, Part 1
The details of the QA tasks are listed in the test plan document.
-
🏆 Quality : All "during the blackout" QA automated tests have succeeded -
🏆 Quality : All "during the blackout" QA manual tests have succeeded
Evaluation of QA results - Decision Point
Phase 6: Commitment
If QA has succeeded, then we can continue to "Complete the Migration". If some
QA has failed, the
The top priority is to maintain data integrity. Failing back after the blackout window has ended is very difficult, and will result in any changes made in the interim being lost.
Don't Panic! Consult the failover priority list
Problems may be categorized into three broad causes - "unknown", "missing data", or "misconfiguration". Testers should focus on determining which bucket a failure falls into, as quickly as possible.
Failures with an unknown cause should be investigated further. If we can't determine the root cause within the blackout window, we should fail back.
We should abort for failures caused by missing data unless all the following apply:
- The scope is limited and well-known
- The data is unlikely to be missed in the very short term
- A named person will own back-filling the missing data on the same day
We should abort for failures caused by misconfiguration unless all the following apply:
- The fix is obvious and simple to apply
- The misconfiguration will not cause data loss or corruption before it is corrected
- A named person will own correcting the misconfiguration on the same day
If the number of failures seems high (double digits?), strongly consider failing back even if they each seem trivial - the causes of each failure may interact in unexpected ways.
Complete the Migration (T plus 2 hours)
Phase 7: Restart Mailing
-
🔪 Chef-Runner : Re-enable mailing queues on sidekiq-asap (revert chef-repo!1922)-
emails_on_push
queue -
mailers
queue -
( admin_emails
queue doesn't exist any more)
-
-
🔪 Chef-Runner : *PRODUCTION ONLY Configure mailroom to use incoming@gitlab.com (instead of incoming-gprd@gitlab.com) e-mail and restart mailroom-
Example MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2026 -
Rotate the password of the incoming@gitlab.com account and update the vault -
Run chef-client and restart mailroom:
$ bundle exec knife ssh role:gprd-base-be-mailroom 'sudo chef-client; sudo gitlab-ctl restart mailroom'
-
-
🔪 Chef-Runner : Start mailroom on all the nodes *PRODUCTION ONLY$ bundle exec knife ssh role:gprd-base-be-mailroom 'sudo gitlab-ctl start mailroom'
Phase 8: Reconfiguration, Part 2
-
🐘 Database-Wrangler : Convert the WAL-E node to a standby node in repmgr Production only-
Run gitlab-ctl repmgr standby setup PRIMARY_FQDN
- This will take a long time
-
-
🐘 Database-Wrangler : Ensure priority is updated in repmgr configuration Production only-
Update in chef cookbooks by removing the setting entirely -
Update in the running database -
On the primary server, run gitlab-psql -d gitlab_repmgr -c 'update repmgr_gitlab_cluster.repl_nodes set priority=100'
-
-
-
🔪 Chef-Runner : Convert Azure Pages IP into a proxy server to the GCP Pages LB-
🔪 Chef-Runner : Complete the MR at https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987 -
🔪 Chef-Runner : Complete a chef-client run on thegitlab-base-lb-pages
role
-
-
🔪 Chef-Runner : Make the GCP environment accessible to the outside world- Staging: Update https://gitlab.com/gitlab-com/gitlab-com-infrastructure/blob/master/environments/gstg/variables.tf
and set
"fe-lb" = [22, 80, 443, 2222]
under the"public_ports"
variable - Production: Update https://gitlab.com/gitlab-com/gitlab-com-infrastructure/blob/master/environments/gprd/variables.tf
and set
"fe-lb" = [22, 80, 443, 2222]
under the"public_ports"
variable
- Staging: Update https://gitlab.com/gitlab-com/gitlab-com-infrastructure/blob/master/environments/gstg/variables.tf
and set
Phase 9: Communicate
-
🐺 Coordinator : Remove the broadcast message -
PRODUCTION ONLY ☎ ️ Comms-Handler : Tweet from@gitlabstatus
- ️
GitLab.com's migration to @GCPcloud is almost complete. Site is back up, although we're continuing to verify that all systems are functioning correctly. We'll be live on YouTube
- ️
Phase 10: Verification, Part 2
-
Start After-Blackout QA This is the second half of the test plan.
-
🏆 Quality : Ensure all "after the blackout" QA automated tests have succeeded -
🏆 Quality : Ensure all "after the blackout" QA manual tests have succeeded
-
PRODUCTION ONLY Post migration
-
☁ ️ Cloud-conductor : Disable unneeded resources in the Azure environment- The Pages LB proxy must be retained
- We should retain all filesystem data for a defined period in case of problems (1 week? 3 months?)
- All machines can be switched off
-
🏆 Quality-Manager : Manually verify all uploads- Compare against saved
missing_on_primary
IDs - Get missing upload files from the old primary, if needed
- Compare against saved
Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times, we need these steps anyway.
In the event of discovering a problem doing the failover on GitLab.com "for real" (i.e. before opening it up to the public), it will also be super-useful to have this documented and tested.
The priority is to get the Azure site working again as quickly as possible. As the GCP side will be inaccessible, returning it to operation is of secondary importance.
Fail back to the Azure site
-
↩ ️ Fail-back Handler : Make the GCP environment inaccessible again, if necessary- Staging: Update https://gitlab.com/gitlab-com/gitlab-com-infrastructure/blob/master/environments/gstg/variables.tf
and set
"fe-lb" = []
under the"public_ports"
variable - Production: ???
- Staging: Update https://gitlab.com/gitlab-com/gitlab-com-infrastructure/blob/master/environments/gstg/variables.tf
and set
-
↩ ️ Fail-back Handler : Update the DNS entries to refer to the Azure load balancer- Navigate to https://console.aws.amazon.com/route53/home?region=us-east-1#resource-record-sets:Z31LJ6JZ6X5VSQ
- Staging: Change the
staging.gitlab.com
andregistry.staging.gitlab.com
DNS entries to point tofe01.stg.gitlab.com
. - Production: ???
-
OPTIONAL: Introduce a split-brain in the postgresql cluster - Only do this if you want to continue using the GCP site as a primary post-failback
- Remove all Azure postgres nodes from the GCP repmgr cluster
- Remove all GCP postgres nodes from the Azure repmgr cluster
-
Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it -
Stop postgresql on the GSTG nodes postgres-0{1,3}-db-gstg: gitlab-ctl stop postgresql
- Skip this if introducing a postgresql split-brain
-
Start postgresql on the Azure staging primary node gitlab-ctl start postgresql
-
Ensure gitlab-ctl repmgr cluster show
reports an Azure node as the primary in Azure:gitlab-ctl repmgr cluster show Role | Name | Upstream | Connection String ----------+-------------------------------------------------|------------------------------|------------------------------------------------------------------------------------------------------- * master | postgres02.db.stg.gitlab.com | | host=postgres02.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr FAILED | postgres-01-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-01-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr FAILED | postgres-03-db-gstg.c.gitlab-staging-1.internal | postgres02.db.stg.gitlab.com | host=postgres-03-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr standby | postgres01.db.stg.gitlab.com | postgres02.db.stg.gitlab.com | host=postgres01.db.stg.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
-
Reinitialize the Azure standby node - Run this in screen / tmux, it can take over an hour. No need to wait for it to complete before continuing
gitlab-ctl repmgr standby setup AZURE_PRIMARY_FQDN -w
-
-
↩ ️ Fail-back Handler : Verify that the DNS update has propagated back online -
↩ ️ Fail-back Handler : Re-enable cronjobs on the primary- Navigate to https://staging.gitlab.com/admin/background_jobs, press "Cron"
- Find the
geo_sidekiq_cron_config_worker
row and press "Enable" on it - All but the Geo-secondary-only queues will be re-enabled
-
↩ ️ Fail-back Handler : Enable access to the azure environment from the outside world
Restore the GCP site to being a working secondary
-
↩ ️ Fail-back Handler : Undo the chef-repo changes from https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989.- If the MR was merged, revert it. If the roles were updated from the MR branch, simply switch to master.
- Then,
bundle exec knife role from file roles/gstg-base-fe-web.json roles/gstg-base.json
-
OPTIONAL: Resolve postgresql cluster split-brain
- Do this if you introduced a postgresql split-brain while failing back Azure
- Add all GCP nodes to the Azure repmgr cluster at lowest priority
- Add all Azure nodes to the GCP repmgr cluster at highest priority
-
Reinitialize the GSTG postgresql nodes that are not fetching WAL-E logs (currently postgres-01-db-gstg.c.gitlab-staging-1.internal, and postgres-03-db-gstg.c.gitlab-staging-1.internal) as a standby in the repmgr cluster -
Remove the old data with
rm -rf /var/opt/gitlab/postgresql/data
-
Re-initialize the database by running:
Note: This step can take over an hour. Consider running it in a screen/tmux session.
# su gitlab-psql -c "/opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby clone --upstream-conninfo 'host=postgres-02-db-gstg.c.gitlab-staging-1.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr' -h postgres-02-db-gstg.c.gitlab-staging-1.internal -d gitlab_repmgr -U gitlab_repmgr" ```
-
Start the database with
gitlab-ctl start postgresql
-
Register the database with the cluster by running
gitlab-ctl repmgr standby register
-
-
↩ ️ Fail-back Handler : Reconfigure every changed gstg node- bundle exec knife ssh roles:gstg-base "sudo chef-client"
-
↩ ️ Fail-back Handler : Clear cache on gstg web nodes to correct broadcast message cachesudo gitlab-rake cache:clear:redis
-
↩ ️ Fail-back Handler : Verify database replication is working- Create an issue on the Azure site and wait to see if it replicates successfully to the GCP site
-
↩ ️ Fail-back Handler : It is now safe to delete the database server snapshots