README.md 16.6 KB
Newer Older
Andrew Newdigate's avatar
Andrew Newdigate committed
1 2
Useful Links |
📖 [**GCP Migration Project Doc**](https://docs.google.com/document/d/1p3Brri44_SKyakViKB-LGWCmCcwILW6z2A8a8eWFyFc/edit) |
Andrew Newdigate's avatar
Andrew Newdigate committed
3
📖 [**GCP Migration Weekly Call**](https://docs.google.com/document/d/1G2PaQqvYsht2oXStNDCOMw5wZetzbKaTi-t32m3fTcc/edit) |
Andrew Newdigate's avatar
Andrew Newdigate committed
4
📁 [**GCP Project Docs**](https://drive.google.com/open?id=1mkpbzwJXmALNVYFPC666bh21R05qOF-e) |
5 6
📘 [**Architecture Docs**](https://drive.google.com/drive/u/0/folders/1v-gy_x98FbUi2bemWSgncdv1K4p3OQh1) |
📘 [**Status Reports**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Status%20Report)
Andrew Newdigate's avatar
Andrew Newdigate committed
7

Andrew Newdigate's avatar
Andrew Newdigate committed
8 9
# GitLab GCP Migration Project

Stan Hu's avatar
Stan Hu committed
10 11 12 13 14 15
## Why are we doing this?

We see a number of advantages for moving from Azure to Google Cloud Platform (GCP):

1. Reliability and performance

16 17 18
    1. GCP offers a low-latency [10Gbps](https://cloud.google.com/interconnect/) interconnect across the board.
    2. GCP offers a global Anycast network as part of their [load balancing](https://cloud.google.com/load-balancing/) service.
    3. GCP also has a track record of exceeding their uptime SLAs for compute VMs.
Stan Hu's avatar
Stan Hu committed
19 20 21

2. Google Kubernetes Engine

Stan Hu's avatar
Stan Hu committed
22
    GitLab 10.1 introduced [built-in support for Google Kubernetes Engine](https://docs.gitlab.com/ce/user/project/clusters/).
Stan Hu's avatar
Stan Hu committed
23
    We expect GKE usage to grow significantly, and it makes sense to bring GitLab.com closer to GCP.
Stan Hu's avatar
Stan Hu committed
24 25 26

3. Pricing

Stan Hu's avatar
Stan Hu committed
27 28 29
    Google offers [sustained use discounts](https://cloud.google.com/compute/docs/sustained-use-discounts) and
    [per second billing](https://cloudplatform.googleblog.com/2017/09/extending-per-second-billing-in-google.html), which
    has saved us a significant amount with shared runners on GitLab.com.
Stan Hu's avatar
Stan Hu committed
30 31 32 33 34

Related articles:

* https://venturebeat.com/2018/04/06/why-and-how-gitlab-abandoned-microsoft-azure-for-google-cloud/

Andrew Newdigate's avatar
Andrew Newdigate committed
35 36 37 38 39 40
## Goal

Goals of the GCP Migration Project

In order of descending priority. Most important goals at the top.

Andrew Newdigate's avatar
Andrew Newdigate committed
41 42 43 44 45
1.  Use the opportunity of an inter-cloud migration to make GitLab.com suitable for mission critical client workloads
1.  Migrate GitLab.com from the Microsoft Azure Cloud platform to the Google Cloud while keeping downtime to a minimum
1.  Use the same helm charts for GitLab.com as our EEP customers use
1.  The goal here is for customers to be able to spin up a 10 person GitLab EEP instance in Kubernetes and scale it up to 100k users (or more) with little effort.
1.  Use the migration as a marketing opportunity for GitLab Inc through creation of technical content
Andrew Newdigate's avatar
Andrew Newdigate committed
46 47 48

More details are available in the [**GCP Migration Project Doc**](https://docs.google.com/document/d/1p3Brri44_SKyakViKB-LGWCmCcwILW6z2A8a8eWFyFc/edit).

49
## Failover
Andrew Newdigate's avatar
Andrew Newdigate committed
50

51 52
The GCP Migration project relies heavily on the [GitLab's Geo](https://about.gitlab.com/features/gitlab-geo/) feature to maintain a secondary GitLab instance in Google Cloud Platform (GCP).

53
The process of promoting the secondary instance in GCP to the primary and switching DNS over to point to the new Primary in GCP is called Planned Failover.
54 55 56 57

### Failover Documentation

The failover procedure is documentation as issue templates:
Andrew Newdigate's avatar
Andrew Newdigate committed
58

Andrew Newdigate's avatar
Andrew Newdigate committed
59 60
| Document                                                                                                                 | Description                          | Instances per Failover |
| ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------ | ---------------------- |
Toon Claes's avatar
Toon Claes committed
61
| [`failover.md`](https://gitlab.com/gitlab-com/migration/blob/master/.gitlab/issue_templates/failover.md)                 | The primary failover tracker.        | One                    |
Andrew Newdigate's avatar
Andrew Newdigate committed
62 63
| [`preflight_checks.md`](https://gitlab.com/gitlab-com/migration/blob/master/.gitlab/issue_templates/preflight_checks.md) | The pre-flight checklist.            | One or two             |
| [`test_plan.md`](https://gitlab.com/gitlab-com/migration/blob/master/.gitlab/issue_templates/test_plan.md)               | The quality assurance test document. | One                    |
Toon Claes's avatar
Toon Claes committed
64
| [`Runbooks`](https://gitlab.com/gitlab-com/migration/blob/master/runbooks/README.md)                                     | The runbooks to resolve issues.      | N/A                    |
Andrew Newdigate's avatar
Andrew Newdigate committed
65

66 67 68 69
### Failover Roles

Staging failovers, or rehearsals, will alternate between the lead and the backups. The production failover will be run by the lead, unless they are unable to attend for some reason.

70 71 72 73
| Role                                  | Description                                                                                        | Lead         | Backup                | Access Required     |
| ------------------------------------- | -------------------------------------------------------------------------------------------------- | ------------ | --------------------- | ------------------- |
| 🐺 Coordinator                        | The conductor of the event. Additionally responsible for replication and verification of all data. | @nick.thomas | @toon, @digitalmoksha | admin & rails       |
| 🔪 Chef-Runner                        | Snapshot staging machines, changes `gitlab.rb`, executes `gitlab-ctl` command (through chef/knife) | @ahmadsherif | @eReGeBe              | ssh & chef          |
Andrew Newdigate's avatar
Andrew Newdigate committed
74
| ☎️ Comms-Handler                      | External comms                                                                                     | @dawsmith    |                       | twitter             |
75 76 77 78
| 🐘 Database-Wrangler                  | Complete the migration                                                                             | @ibaum       | @jarv                 | ssh & chef          |
| ☁️ Cloud-conductor                    | Changes settings in GCP and Azure consoles. Handles DNS changes                                    | @ahmadsherif | @eReGeBe              | azure & gcp console |
| 🏆 Quality-Manager                    | Owns the during- and post- failover quality assurance                                              | @meks        | @rymai                | admin               |
| ↩️ Fail-back Handler (_Staging Only_) | Fail-back, discarding changes to GCP                                                               | @ahmadsherif | @eReGeBe              | azure & gcp         |
Andrew Newdigate's avatar
Andrew Newdigate committed
79
| 🎩 Head- Honcho (_Production Only_)   | Executive-level decision maker                                                                     | @edjdev      | @sytses               |                     |
80

81
### Failover Priorities
82 83 84 85 86

The [GCP Migration goals](#goal) are stated above. However, the failover is complex and technical issues may arise. In order to make decisions quickly, these are the priorities for the failover, in order of descending priority:

1. **Protect the integrity of data**
1. Ensure that all **critical features are functioning correctly**
Andrew Newdigate's avatar
Andrew Newdigate committed
87
   * For a list of what's considered "critical" review the "during blackout" features in [QA Plan](https://docs.google.com/spreadsheets/d/15AtBb6s2p_HvtUe5G9GUSc2ngt69X8dO-418zMuT4us/edit)
88 89
1. **Migrate GitLab.com** from Azure to Google Cloud Platform
1. Ensure that **all features are functioning correctly**
90
1. Do not exceed the **time limits of the announced blackout** window
91

Andrew Newdigate's avatar
Andrew Newdigate committed
92
## Project Process
Andrew Newdigate's avatar
Andrew Newdigate committed
93

Andrew Newdigate's avatar
Andrew Newdigate committed
94
### Label Taxonomy
Andrew Newdigate's avatar
Andrew Newdigate committed
95

Andrew Newdigate's avatar
Andrew Newdigate committed
96
#### Workflow ([️🗺️ Board](https://gitlab.com/gitlab-com/migration/boards/571221))
Andrew Newdigate's avatar
Andrew Newdigate committed
97

Andrew Newdigate's avatar
Andrew Newdigate committed
98 99 100 101 102 103
| Status      | Description                                                                                                      | Label          |
| ----------- | ---------------------------------------------------------------------------------------------------------------- | -------------- |
| Planning    | Issue not ready for assignment or execution                                                                      | ~"Planning"    |
| Ready       | Issue is ready for execution, awaiting assignment                                                                | ~"Ready"       |
| Blocked     | Issue is blocked. When you are blocked please signal by assigning this label and clearly indicating the blocker. | ~"blocked"     |
| In Progress | Issue is being actively worked on                                                                                | ~"In Progress" |
Andrew Newdigate's avatar
Andrew Newdigate committed
104

105 106 107 108
![](https://docs.google.com/spreadsheets/d/e/2PACX-1vQAirka6fKQKXd65cc76vzIBUUZmOzGIaQ1ZuWNhmQsIJXvSVBpX_-Gc9DOXw-sZ5TMY31KIUWwndPK/pubchart?oid=244557896&format=image)

[Burndown from 15 May 2018](https://docs.google.com/spreadsheets/d/1H9h5fLzGpOkdXnledNWG0-_8iaTss1Cb9K784qr8qTs/edit).

Andrew Newdigate's avatar
Andrew Newdigate committed
109
#### Sequencing ([🗺️ Board](https://gitlab.com/gitlab-com/migration/boards/572687))
Andrew Newdigate's avatar
Andrew Newdigate committed
110

Andrew Newdigate's avatar
Andrew Newdigate committed
111
Most issues can be broadly broken down into pre-migration or post-migration tasks, depending on whether they need to be undertaken before the failover event, or after.
Andrew Newdigate's avatar
Andrew Newdigate committed
112

113 114 115 116
| Sequencing    | Label            | Board                                                                                                            |
| ------------- | ---------------- | ---------------------------------------------------------------------------------------------------------------- |
| Premigration  | ~"Premigration"  | [Premigration Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Premigration)   |
| Postmigration | ~"Postmigration" | [Postmigration Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Postmigration) |
Andrew Newdigate's avatar
Andrew Newdigate committed
117

118 119 120 121
![](https://docs.google.com/spreadsheets/d/e/2PACX-1vQAirka6fKQKXd65cc76vzIBUUZmOzGIaQ1ZuWNhmQsIJXvSVBpX_-Gc9DOXw-sZ5TMY31KIUWwndPK/pubchart?oid=300715984&format=image)

[Burndown from 15 May 2018](https://docs.google.com/spreadsheets/d/1H9h5fLzGpOkdXnledNWG0-_8iaTss1Cb9K784qr8qTs/edit).

Andrew Newdigate's avatar
Andrew Newdigate committed
122
#### Workstreams ([🗺️ Board](https://gitlab.com/gitlab-com/migration/boards/572785))
Andrew Newdigate's avatar
Andrew Newdigate committed
123

Andrew Newdigate's avatar
Andrew Newdigate committed
124
Issues are categorized into several streams of work.
Andrew Newdigate's avatar
Andrew Newdigate committed
125

Andrew Newdigate's avatar
Andrew Newdigate committed
126 127 128 129 130 131
| Workstream             | Label                                 |
| ---------------------- | ------------------------------------- |
| Failover Testing       | ~"Workstream: Failover Testing"       |
| Logging and Monitoring | ~"Workstream: Logging and Monitoring" |
| Post Failover          | ~"Workstream: Post Failover"          |
| Staging                | ~"Workstream: Staging"                |
Andrew Newdigate's avatar
Andrew Newdigate committed
132

133 134 135 136
![](https://docs.google.com/spreadsheets/d/e/2PACX-1vQAirka6fKQKXd65cc76vzIBUUZmOzGIaQ1ZuWNhmQsIJXvSVBpX_-Gc9DOXw-sZ5TMY31KIUWwndPK/pubchart?oid=1053495026&format=image)

[Burndown from 15 May 2018](https://docs.google.com/spreadsheets/d/1H9h5fLzGpOkdXnledNWG0-_8iaTss1Cb9K784qr8qTs/edit).

Andrew Newdigate's avatar
Andrew Newdigate committed
137
#### Teams ([🗺️ Board](https://gitlab.com/gitlab-com/migration/boards/571296))
Andrew Newdigate's avatar
Andrew Newdigate committed
138

Andrew Newdigate's avatar
Andrew Newdigate committed
139
Each [team](https://about.gitlab.com/team/chart/) involved in the effort has a label associated with the issues they are responsible for.
Andrew Newdigate's avatar
Andrew Newdigate committed
140

141 142
| Team                                                                       | Label              | Board                                                                                                                  |
| -------------------------------------------------------------------------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------- |
Andrew Newdigate's avatar
Andrew Newdigate committed
143
| [Production](https://about.gitlab.com/handbook/infrastructure/production/) | ~"Team:Production" | [Production Team Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Team%3AProduction) |
144 145 146
| Geo                                                                        | ~"Team:Geo"        | [Geo Team Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Team%3AGeo)               |
| [Security](https://about.gitlab.com/handbook/engineering/security)         | ~"Team:Security"   | [Security Team Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Team%3ASecurity)     |
| [Quality](https://about.gitlab.com/handbook/quality/)                      | ~"Team:Quality"    | [Quality Team Workflow Board](https://gitlab.com/gitlab-com/migration/boards/571221?label_name[]=Team%3AQuality)       |
Andrew Newdigate's avatar
Andrew Newdigate committed
147

148 149 150 151
![](https://docs.google.com/spreadsheets/d/e/2PACX-1vQAirka6fKQKXd65cc76vzIBUUZmOzGIaQ1ZuWNhmQsIJXvSVBpX_-Gc9DOXw-sZ5TMY31KIUWwndPK/pubchart?oid=288575789&format=image)

[Burndown from 15 May 2018](https://docs.google.com/spreadsheets/d/1H9h5fLzGpOkdXnledNWG0-_8iaTss1Cb9K784qr8qTs/edit).

Andrew Newdigate's avatar
Andrew Newdigate committed
152 153 154 155 156 157 158 159
## Issue Triage Queries

1.  [**Issues without Labels**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=No+Label) - check for untriaged issues
1.  [**In Progress, No Milestone**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=In%20Progress&milestone_title=No+Milestone) - Ready, but unscheduled
1.  [**In Progress, No Assignee**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=In%20Progress&assignee_id=0) - check for issues that are ~"In Progress" without an assignee
1.  [**In Progress Issues**](https://gitlab.com/gitlab-com/migration/issues?label_name%5B%5D=In+Progress&scope=all&sort=updated_desc&state=opened) - check for issues that have been ~"In Progress" for too long
1.  [**Ready Issues without Weight**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Ready&weight=No+Weight) - issues that are ~Ready, but have not been weighed
1.  [**Ready Issues with a Started Milestone**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Ready&milestone_title=%23started) - upcoming scheduled work
Andrew Newdigate's avatar
Andrew Newdigate committed
160
1.  [**Issues Awaiting More Information**](https://gitlab.com/gitlab-com/migration/issues?label_name%5B%5D=Awaiting+Update) - issues that appear to have stalled and are awaiting more information from the assignee or another team member
161
1.  [**Deadlocked Issues**](https://gitlab.com/gitlab-com/migration/issues?label_name%5B%5D=Deadlocked) - issues that are not making progress towards resolution
162
1.  [**Failover Originated**](https://gitlab.com/gitlab-com/migration/issues?label_name%5B%5D=Failover+Originated) - issues that were raised through the failover rehearsal
163

164
### Eisenhower Decision Matrix Triage
165 166 167 168 169 170

1.  [**Do**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Importance%3AHigh&label_name[]=Urgency%3AHigh) - Do it now. Issues that are ~"Importance:High" and ~"Urgency:High"
1.  [**Decide**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Importance%3AHigh&label_name[]=Urgency%3ALow) - Schedule a time to do it. Issues that are ~"Importance:High" and ~"Urgency:Low"
1.  [**Delegate**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Importance%3ALow&label_name[]=Urgency%3AHigh) - Who can do it for you? Issues that are ~"Importance:Low" and ~"Urgency:High"
1.  [**Delete**](https://gitlab.com/gitlab-com/migration/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Importance%3ALow&label_name[]=Urgency%3ALow) - Eliminate it. Issues that are ~"Importance:Low" and ~"Urgency:Low"

Andrew Newdigate's avatar
Andrew Newdigate committed
171
## Related Projects
Andrew Newdigate's avatar
Andrew Newdigate committed
172

Andrew Newdigate's avatar
Andrew Newdigate committed
173 174 175 176
1.  **Cloud Native GitLab Helm Charts**: https://gitlab.com/charts/helm.gitlab.io
1.  **Automate the lifecycle of environments for GitLab.com**: https://gitlab.com/gitlab-com/environments
1.  **GitLab.com Infrastructure**: https://gitlab.com/gitlab-com/infrastructure
1.  **GitLab CE**: https://gitlab.com/gitlab-org/gitlab-ce
Andrew Newdigate's avatar
Andrew Newdigate committed
177 178 179

## Preparing for a Failover Run

Andrew Newdigate's avatar
Andrew Newdigate committed
180 181 182 183 184
Before a failover, the coordinator needs to login to the deploy host:
* `deploy-01-sv-gprd.c.gitlab-production.internal` for production
* `deploy-01-sv-gstg.c.gitlab-staging-1.internal` for staging

Then carry out the following steps:
Andrew Newdigate's avatar
Andrew Newdigate committed
185

Andrew Newdigate's avatar
Andrew Newdigate committed
186 187 188 189 190
1.  **Setup `bin/source_vars`**: `test -f /opt/gitlab-migration/bin/source_vars || sudo cp /opt/gitlab-migration/bin/source_vars_template.sh /opt/gitlab-migration/bin/source_vars`
1.  **Configure `vi /opt/gitlab-migration/bin/source_vars`**: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's `.gitignore`'d)
1.  **Verify `/opt/gitlab-migration/bin/verify-failover-config`**: You should receive a message indicating success
1.  **Setup the workflow issues**": Run `/opt/gitlab-migration/bin/start-failover-procedure.sh`. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc.
    * Any variables in the template in the format `__VARIABLE__` will be substituted with their values from the `bin/source_vars` file, saving manual effort.
191 192 193 194 195 196 197 198 199 200

### Migration scripts

1. Prepare file `env_<environment>` pointing environment variables to correct hosts.
1. Steps scripts are mapped in file `steps_<role>` in order of execution (To define failback steps just add `_failback` suffix to the role)
1. To run the runbook script menu use the `migration` script:

```shell
bash bin/migration <environment> <role>
```