Useful Links | πŸ“– [**GCP Migration Project Doc**](https://docs.google.com/document/d/1p3Brri44_SKyakViKB-LGWCmCcwILW6z2A8a8eWFyFc/edit) | πŸ“– [**GCP Migration Weekly Call**](https://docs.google.com/document/d/1G2PaQqvYsht2oXStNDCOMw5wZetzbKaTi-t32m3fTcc/edit) | πŸ“ [**GCP Project Docs**](https://drive.google.com/open?id=1mkpbzwJXmALNVYFPC666bh21R05qOF-e) | πŸ“˜ [**Architecture Docs**](https://drive.google.com/drive/u/0/folders/1v-gy_x98FbUi2bemWSgncdv1K4p3OQh1) | πŸ“˜ [**Status Reports**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Status%20Report) # GitLab GCP Migration Project ## Why are we doing this? We see a number of advantages for moving from Azure to Google Cloud Platform (GCP): 1. Reliability and performance 1. GCP offers a low-latency [10Gbps](https://cloud.google.com/interconnect/) interconnect across the board. 2. GCP offers a global Anycast network as part of their [load balancing](https://cloud.google.com/load-balancing/) service. 3. GCP also has a track record of exceeding their uptime SLAs for compute VMs. 2. Google Kubernetes Engine GitLab 10.1 introduced [built-in support for Google Kubernetes Engine](https://docs.gitlab.com/ce/user/project/clusters/). We expect GKE usage to grow significantly, and it makes sense to bring GitLab.com closer to GCP. 3. Pricing Google offers [sustained use discounts](https://cloud.google.com/compute/docs/sustained-use-discounts) and [per second billing](https://cloudplatform.googleblog.com/2017/09/extending-per-second-billing-in-google.html), which has saved us a significant amount with shared runners on GitLab.com. Related articles: * https://venturebeat.com/2018/04/06/why-and-how-gitlab-abandoned-microsoft-azure-for-google-cloud/ ## Goal Goals of the GCP Migration Project In order of descending priority. Most important goals at the top. 1. Use the opportunity of an inter-cloud migration to make GitLab.com suitable for mission critical client workloads 1. Migrate GitLab.com from the Microsoft Azure Cloud platform to the Google Cloud while keeping downtime to a minimum 1. Use the same helm charts for GitLab.com as our EEP customers use 1. The goal here is for customers to be able to spin up a 10 person GitLab EEP instance in Kubernetes and scale it up to 100k users (or more) with little effort. 1. Use the migration as a marketing opportunity for GitLab Inc through creation of technical content More details are available in the [**GCP Migration Project Doc**](https://docs.google.com/document/d/1p3Brri44_SKyakViKB-LGWCmCcwILW6z2A8a8eWFyFc/edit). ## Failover The GCP Migration project relies heavily on the [GitLab's Geo](https://about.gitlab.com/features/gitlab-geo/) feature to maintain a secondary GitLab instance in Google Cloud Platform (GCP). The process of promoting the secondary instance in GCP to the primary and switching DNS over to point to the new Primary in GCP is called Planned Failover. ### Failover Documentation The failover procedure is documentation as issue templates: | Document | Description | Instances per Failover | | ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------ | ---------------------- | | [`failover.md`](https://gitlab.com/gitlab-com/migration/blob/master/.gitlab/issue_templates/failover.md) | The primary failover tracker. | One | | [`preflight_checks.md`](https://gitlab.com/gitlab-com/migration/blob/master/.gitlab/issue_templates/preflight_checks.md) | The pre-flight checklist. | One or two | | [`test_plan.md`](https://gitlab.com/gitlab-com/migration/blob/master/.gitlab/issue_templates/test_plan.md) | The quality assurance test document. | One | | [`Runbooks`](https://gitlab.com/gitlab-com/migration/blob/master/runbooks/README.md) | The runbooks to resolve issues. | N/A | ### Failover Roles Staging failovers, or rehearsals, will alternate between the lead and the backups. The production failover will be run by the lead, unless they are unable to attend for some reason. | Role | Description | Lead | Backup | Access Required | | ------------------------------------- | -------------------------------------------------------------------------------------------------- | ------------ | --------------------- | ------------------- | | 🐺 Coordinator | The conductor of the event. Additionally responsible for replication and verification of all data. | @nick.thomas | @toon, @digitalmoksha | admin & rails | | πŸ”ͺ Chef-Runner | Snapshot staging machines, changes `gitlab.rb`, executes `gitlab-ctl` command (through chef/knife) | @ahmadsherif | @eReGeBe | ssh & chef | | ☎️ Comms-Handler | External comms | @dawsmith | | twitter | | 🐘 Database-Wrangler | Complete the migration | @ibaum | @jarv | ssh & chef | | ☁️ Cloud-conductor | Changes settings in GCP and Azure consoles. Handles DNS changes | @ahmadsherif | @eReGeBe | azure & gcp console | | πŸ† Quality-Manager | Owns the during- and post- failover quality assurance | @meks | @rymai | admin | | ↩️ Fail-back Handler (_Staging Only_) | Fail-back, discarding changes to GCP | @ahmadsherif | @eReGeBe | azure & gcp | | 🎩 Head- Honcho (_Production Only_) | Executive-level decision maker | @edjdev | @sytses | | ### Failover Priorities The [GCP Migration goals](#goal) are stated above. However, the failover is complex and technical issues may arise. In order to make decisions quickly, these are the priorities for the failover, in order of descending priority: 1. **Protect the integrity of data** 1. Ensure that all **critical features are functioning correctly** * For a list of what's considered "critical" review the "during blackout" features in [QA Plan](https://docs.google.com/spreadsheets/d/15AtBb6s2p_HvtUe5G9GUSc2ngt69X8dO-418zMuT4us/edit) 1. **Migrate GitLab.com** from Azure to Google Cloud Platform 1. Ensure that **all features are functioning correctly** 1. Do not exceed the **time limits of the announced blackout** window ## Project Process ### Label Taxonomy #### Workflow ([οΈπŸ—ΊοΈ Board](https://gitlab.com/gitlab-com/migration/boards/571221)) | Status | Description | Label | | ----------- | ---------------------------------------------------------------------------------------------------------------- | -------------- | | Planning | Issue not ready for assignment or execution | ~"Planning" | | Ready | Issue is ready for execution, awaiting assignment | ~"Ready" | | Blocked | Issue is blocked. When you are blocked please signal by assigning this label and clearly indicating the blocker. | ~"blocked" | | In Progress | Issue is being actively worked on | ~"In Progress" | ![](https://docs.google.com/spreadsheets/d/e/2PACX-1vQAirka6fKQKXd65cc76vzIBUUZmOzGIaQ1ZuWNhmQsIJXvSVBpX_-Gc9DOXw-sZ5TMY31KIUWwndPK/pubchart?oid=244557896&format=image) [Burndown from 15 May 2018](https://docs.google.com/spreadsheets/d/1H9h5fLzGpOkdXnledNWG0-_8iaTss1Cb9K784qr8qTs/edit). #### Sequencing ([πŸ—ΊοΈ Board](https://gitlab.com/gitlab-com/migration/boards/572687)) Most issues can be broadly broken down into pre-migration or post-migration tasks, depending on whether they need to be undertaken before the failover event, or after. | Sequencing | Label | Board | | ------------- | ---------------- | ---------------------------------------------------------------------------------------------------------------- | | Premigration | ~"Premigration" | [Premigration Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Premigration) | | Postmigration | ~"Postmigration" | [Postmigration Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Postmigration) | ![](https://docs.google.com/spreadsheets/d/e/2PACX-1vQAirka6fKQKXd65cc76vzIBUUZmOzGIaQ1ZuWNhmQsIJXvSVBpX_-Gc9DOXw-sZ5TMY31KIUWwndPK/pubchart?oid=300715984&format=image) [Burndown from 15 May 2018](https://docs.google.com/spreadsheets/d/1H9h5fLzGpOkdXnledNWG0-_8iaTss1Cb9K784qr8qTs/edit). #### Workstreams ([πŸ—ΊοΈ Board](https://gitlab.com/gitlab-com/migration/boards/572785)) Issues are categorized into several streams of work. | Workstream | Label | | ---------------------- | ------------------------------------- | | Failover Testing | ~"Workstream: Failover Testing" | | Logging and Monitoring | ~"Workstream: Logging and Monitoring" | | Post Failover | ~"Workstream: Post Failover" | | Staging | ~"Workstream: Staging" | ![](https://docs.google.com/spreadsheets/d/e/2PACX-1vQAirka6fKQKXd65cc76vzIBUUZmOzGIaQ1ZuWNhmQsIJXvSVBpX_-Gc9DOXw-sZ5TMY31KIUWwndPK/pubchart?oid=1053495026&format=image) [Burndown from 15 May 2018](https://docs.google.com/spreadsheets/d/1H9h5fLzGpOkdXnledNWG0-_8iaTss1Cb9K784qr8qTs/edit). #### Teams ([πŸ—ΊοΈ Board](https://gitlab.com/gitlab-com/migration/boards/571296)) Each [team](https://about.gitlab.com/team/chart/) involved in the effort has a label associated with the issues they are responsible for. | Team | Label | Board | | -------------------------------------------------------------------------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------- | | [Production](https://about.gitlab.com/handbook/infrastructure/production/) | ~"Team:Production" | [Production Team Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Team%3AProduction) | | Geo | ~"Team:Geo" | [Geo Team Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Team%3AGeo) | | [Security](https://about.gitlab.com/handbook/engineering/security) | ~"Team:Security" | [Security Team Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Team%3ASecurity) | | [Quality](https://about.gitlab.com/handbook/quality/) | ~"Team:Quality" | [Quality Team Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Team%3AQuality) | ![](https://docs.google.com/spreadsheets/d/e/2PACX-1vQAirka6fKQKXd65cc76vzIBUUZmOzGIaQ1ZuWNhmQsIJXvSVBpX_-Gc9DOXw-sZ5TMY31KIUWwndPK/pubchart?oid=288575789&format=image) [Burndown from 15 May 2018](https://docs.google.com/spreadsheets/d/1H9h5fLzGpOkdXnledNWG0-_8iaTss1Cb9K784qr8qTs/edit). ## Issue Triage Queries 1. [**Issues without Labels**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=No+Label) - check for untriaged issues 1. [**In Progress, No Milestone**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=In%20Progress&milestone_title=No+Milestone) - Ready, but unscheduled 1. [**In Progress, No Assignee**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=In%20Progress&assignee_id=0) - check for issues that are ~"In Progress" without an assignee 1. [**In Progress Issues**](https://gitlab.com/gitlab-com/migration/issues?label_name%5B%5D=In+Progress&scope=all&sort=updated_desc&state=opened) - check for issues that have been ~"In Progress" for too long 1. [**Ready Issues without Weight**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Ready&weight=No+Weight) - issues that are ~Ready, but have not been weighed 1. [**Ready Issues with a Started Milestone**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Ready&milestone_title=%23started) - upcoming scheduled work 1. [**Issues Awaiting More Information**](https://gitlab.com/gitlab-com/migration/issues?label_name%5B%5D=Awaiting+Update) - issues that appear to have stalled and are awaiting more information from the assignee or another team member 1. [**Deadlocked Issues**](https://gitlab.com/gitlab-com/migration/issues?label_name%5B%5D=Deadlocked) - issues that are not making progress towards resolution 1. [**Failover Originated**](https://gitlab.com/gitlab-com/migration/issues?label_name%5B%5D=Failover+Originated) - issues that were raised through the failover rehearsal ### Eisenhower Decision Matrix Triage 1. [**Do**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Importance%3AHigh&label_name[]=Urgency%3AHigh) - Do it now. Issues that are ~"Importance:High" and ~"Urgency:High" 1. [**Decide**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Importance%3AHigh&label_name[]=Urgency%3ALow) - Schedule a time to do it. Issues that are ~"Importance:High" and ~"Urgency:Low" 1. [**Delegate**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Importance%3ALow&label_name[]=Urgency%3AHigh) - Who can do it for you? Issues that are ~"Importance:Low" and ~"Urgency:High" 1. [**Delete**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Importance%3ALow&label_name[]=Urgency%3ALow) - Eliminate it. Issues that are ~"Importance:Low" and ~"Urgency:Low" ## Related Projects 1. **Cloud Native GitLab Helm Charts**: https://gitlab.com/charts/helm.gitlab.io 1. **Automate the lifecycle of environments for GitLab.com**: https://gitlab.com/gitlab-com/environments 1. **GitLab.com Infrastructure**: https://gitlab.com/gitlab-com/infrastructure 1. **GitLab CE**: https://gitlab.com/gitlab-org/gitlab-ce ## Preparing for a Failover Run Before a failover, the coordinator needs to login to the deploy host: * `deploy-01-sv-gprd.c.gitlab-production.internal` for production * `deploy-01-sv-gstg.c.gitlab-staging-1.internal` for staging Then carry out the following steps: 1. **Setup `bin/source_vars`**: `test -f /opt/gitlab-migration/bin/source_vars || sudo cp /opt/gitlab-migration/bin/source_vars_template.sh /opt/gitlab-migration/bin/source_vars` 1. **Configure `vi /opt/gitlab-migration/bin/source_vars`**: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's `.gitignore`'d) 1. **Verify `/opt/gitlab-migration/bin/verify-failover-config`**: You should receive a message indicating success 1. **Setup the workflow issues**": Run `/opt/gitlab-migration/bin/start-failover-procedure.sh`. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc. * Any variables in the template in the format `__VARIABLE__` will be substituted with their values from the `bin/source_vars` file, saving manual effort. ### Migration scripts 1. Prepare file `env_` pointing environment variables to correct hosts. 1. Steps scripts are mapped in file `steps_` in order of execution (To define failback steps just add `_failback` suffix to the role) 1. To run the runbook script menu use the `migration` script: ```shell bash bin/migration ```