Commit 027ed03a authored by Andrew Newdigate's avatar Andrew Newdigate

Alerts off the back of 553

parent 06ce9cb6
groups:
- name: workhorse.rules
rules:
- alert: gitlab_workhorse_git_http_sessions_active_out_of_bounds_lower_5m
expr: |
gitlab_workhorse_git_http_sessions_active:total
<
gitlab_workhorse_git_http_sessions_active:total:avg_over_time_1w - 3 * gitlab_workhorse_git_http_sessions_active:total:stddev_over_time_1w
for: 5m
labels:
rules_domain: general
metric: gitlab_workhorse_git_http_sessions_active:total
severity: warn
period: 5m
bound: lower
threshold_sigma: "3"
annotations:
description: |
This may be caused by an upstream load-balancer issue, DNS issue or authentication issues.
runbook: "troubleshooting/workhorse-git-session-alerts.md"
title: "The number of active Git HTTP sessions is unusually low"
grafana_dashboard_id: "jIYYw9-ik/gitlab-workhorse-alerting"
grafana_panel_id: "2"
grafana_variables: "environment,type,stage"
grafana_min_zoom_hours: 12
- alert: gitlab_workhorse_git_http_sessions_active_out_of_bounds_upper_5m
expr: |
gitlab_workhorse_git_http_sessions_active:total
>
gitlab_workhorse_git_http_sessions_active:total:avg_over_time_1w + 3 * gitlab_workhorse_git_http_sessions_active:total:stddev_over_time_1w
for: 5m
labels:
rules_domain: general
metric: gitlab_workhorse_git_http_sessions_active:total
severity: warn
period: 5m
bound: lower
threshold_sigma: "3"
annotations:
description: |
The number of Git HTTP sessions is unusually high. This could be because of a surge in traffic,
a bottleneck in a upstream load-balancer, or a slow backend Gitaly server.
runbook: "troubleshooting/workhorse-git-session-alerts.md"
title: "The number of active Git HTTP sessions is unusually high"
grafana_dashboard_id: "jIYYw9-ik/gitlab-workhorse-alerting"
grafana_panel_id: "2"
grafana_variables: "environment,type,stage"
grafana_min_zoom_hours: 12
groups:
- name: GitLab Workhorse Git HTTP Session Count
interval: 1m
rules:
- record: gitlab_workhorse_git_http_sessions_active:total
labels:
type: git
tier: sv
expr: >
sum(avg_over_time(gitlab_workhorse_git_http_sessions_active{type="git", tier="sv"}[1m])) by (environment, stage, tier, type)
- record: gitlab_workhorse_git_http_sessions_active:total:avg_over_time_1w
labels:
type: git
tier: sv
expr: >
avg_over_time(gitlab_workhorse_git_http_sessions_active:total[1w])
- record: gitlab_workhorse_git_http_sessions_active:total:stddev_over_time_1w
labels:
type: git
tier: sv
expr: >
stddev_over_time(gitlab_workhorse_git_http_sessions_active:total[1w])
# Workhorse Session Alerts
## Symptoms
![Workhorse HTTP](../img/workhorse-git-http-session-issues.png)
## Possible checks
* HAProxy may not be able to keep up
* Gitaly is overloaded
* Gitaly rate-limiting issues
## Reference Issues
[**2018-11-06: Up to 15 minute delays on clones from GitLab repositories, including www-gitlab-com, gitlab-ee, gitlab-ce*](https://gitlab.com/gitlab-com/gl-infra/production/issues/553)
* A S2 level incident lasting 3 days, led to disruption to git clones, in particular for the `www-gitlab-com`, `gitlab-ee`, `gitlab-ce`
although many others were affected to.
* Diagnosis went around in circles:
* Initially targetted abuse
* High CI activity rates
* Workhorse throughput
* Network issues
* Gitaly concurrency limits (which had contributed)
* Smoking gun: not only git clones which were slow, artifact downloads against S3 had also sky-rocket in latency
* Testing git clones and artifact downloads via https://gitlab.com, then against front-end load balancers, then again Workhorse helps us pinpoint the issue with the HAProxy front-end fleet.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment