Commit ab893e20 authored by Andrew Newdigate's avatar Andrew Newdigate

Better alert descriptions

parent 40ad16df
groups:
- name: service_availability.rules
rules:
# Availability above 2 sigma
# Availability below 2 sigma
- alert: service_availability_out_of_bounds_lower_2sigma_5m
expr: |
gitlab_service_availability:ratio
......@@ -17,12 +17,18 @@ groups:
threshold_sigma: "2.5"
annotations:
description: |
Server is running outside of normal availability parameters
The ratio of services that are available to serve the `{{ $labels.type }}` service
is unusually low, and a unusually large proportion of the fleet is not responding.
This may be caused the service crashing and restarting due to segfaults, memory pressure,
application errors. Check error rate metrics, application logs, sentry for root cause.
runbook: "troubleshooting/service-{{ $labels.type }}.md"
title: "{{ $labels.type }} service operation rate alert"
title: "The `{{ $labels.type }}` service is less available than normal"
grafana_dashboard_id: "WOtyonOiz/general-triage-service"
grafana_panel_id: "2"
grafana_variables: "environment,type"
link1_title: "Definition"
link1_url: "https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/definition-service-availability.md"
# Operation rate above 2 sigma
- alert: service_ops_out_of_bounds_upper_2sigma_5m
......@@ -40,12 +46,19 @@ groups:
threshold_sigma: "2.5"
annotations:
description: |
Server is running outside of normal operation rate parameters
The `{{ $labels.type }}` service is receiving more requests than normal.
This is often caused by user generated traffic, sometimes abuse. It can also be cause by
application changes that lead to higher operations rates or from retries in the event of
errors. Check the abuse reporting watches in Elastic, ELK for possible abuse,
error rates (possibly on upstream services) for root cause.
runbook: "troubleshooting/service-{{ $labels.type }}.md"
title: "{{ $labels.type }} service operation rate alert"
title: "The `{{ $labels.type }}` service is receiving more requests than normal"
grafana_dashboard_id: "WOtyonOiz/general-triage-service"
grafana_panel_id: "12"
grafana_variables: "environment,type"
link1_title: "Definition"
link1_url: "https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/definition-service-ops-rate.md"
# Operation rate below 2 sigma
- alert: service_ops_out_of_bounds_lower_2sigma_5m
......@@ -63,12 +76,18 @@ groups:
threshold_sigma: "2.5"
annotations:
description: |
Server is running outside of normal operation rate parameters
The `{{ $labels.type }}` service is receiving fewer requests than normal.
This is often caused by a failure in an upstream service - for example, an upstream load balancer rejected
all incoming traffic. In many cases, this is as serious or more serious than a traffic spike. Check
upstream services for errors that may be leading to traffic flow issues in downstream services.
runbook: "troubleshooting/service-{{ $labels.type }}.md"
title: "{{ $labels.type }} service operation rate alert"
title: "The `{{ $labels.type }}` service is receiving fewer requests than normal"
grafana_dashboard_id: "WOtyonOiz/general-triage-service"
grafana_panel_id: "12"
grafana_variables: "environment,type"
link1_title: "Definition"
link1_url: "https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/definition-service-ops-rate.md"
# Apdex lower than 2 sigma
- alert: service_apdex_out_of_bounds_lower_2sigma_5m
......@@ -86,12 +105,19 @@ groups:
threshold_sigma: "2.5"
annotations:
description: |
Server is running outside of normal apdex parameters
The `{{ $labels.type }}` service is operating at a slower rate than normal.
The service is taking longer to respond to requests than usual. This could be caused by
user abuse, application changes in upstream services that lead to higher request rates or slower
requested, or slowdown in downstream services. Check operation rates in upstream and downstream
services, error rates and check ELK for abuse.
runbook: "troubleshooting/service-{{ $labels.type }}.md"
title: "{{ $labels.type }} service apdex alert"
title: "`{{ $labels.type }}` service is operating at a slower rate than normal"
grafana_dashboard_id: "WOtyonOiz/general-triage-service"
grafana_panel_id: "16"
grafana_variables: "environment,type"
link1_title: "Definition"
link1_url: "https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/definition-service-apdex.md"
# Error rate exceeds 2 sigma
- alert: service_errors_out_of_bounds_upper_2sigma_5m
......@@ -109,9 +135,15 @@ groups:
threshold_sigma: "2.5"
annotations:
description: |
Server is running outside of normal error parameters
The `{{ $labels.type }}` service is generating more errors than normal.
The service is generating more errors than usual. This could be caused by application changes,
downstream service failures or user-invoked failures.
Review sentry errors, ELK and downstream service alerts.
runbook: "troubleshooting/service-{{ $labels.type }}.md"
title: "{{ $labels.type }} service error alert"
grafana_dashboard_id: "WOtyonOiz/general-triage-service"
grafana_panel_id: "24"
grafana_variables: "environment,type"
link1_title: "Definition"
link1_url: "https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/definition-service-errors.md"
# Service Apdex
The apdex score for a service is a measure of relative performance for a service.
Our apdex scoring is loosely based on NewRelic's apdex scoring: https://docs.newrelic.com/docs/apm/new-relic-apm/apdex/apdex-measure-user-satisfaction
For a given service, we define two latency values:
* **Satisfied**: this is the target latency for incoming requests to this service.
* **Tolerated**: this is a acceptable, tolerated latency.
The apdex score for a service is measured as:
```
(satisfied requests) + (tolerated requests)/2
---------------------------------------------
total number of requests
```
Note, importantly that requests are only considered satisfactory and torelarable if they're successful and do not fail.
This implies that the apdex score includes a error-rate factor - as more request fail, the apdex score will tend to zero,
no matter the latency of the failures.
## Determining availability
The apdex score for a service depends on the service exporting Prometheus latency histograms. For this reason, we currently to do have
apdex scores for postgres or redis.
Additionally, for some services, such as Gitaly, the apdex is based on a subset of all requests. For example, in Gitaly, the GC
request latency is a function of repository size and time since last GC. In normal operation, this call may take up to 30 minutes.
Including this in the apdex score is unhelpful and does not provide insight into the state of the service, so it is excluded from
the metric.
## Service Availablity Definitions
The definitions of service availability are defined in https://gitlab.com/gitlab-com/runbooks/blob/master/recordings/service_apdex.yml
# Service Availability
At GitLab, we define the availability of a service as a ratio of: `the number of instances of a service reporting as healthy` / `the expecting number if instances of that service`.
It is a measure of the health-check status of a service.
For example, if we expect there to be 20 `unicorn` processes in the `web` fleet and 15 are available and reporting as healthy, then the availablity of the `unicorn` _component_ of the `web` _service_ is 0.75, or 75%.
## Determining availability
This is usually done in one of three ways:
1. For services which host their own Prometheus metrics endpoint internally, we usually rely on the status of the `up` metric for the service. This endpoint is (by design) a useful health-check onto a process. In order to have `up{..}=1` the service needs to be running, listening on the port and correctly responding to incoming requests.
1. For services that use sidecar Prometheus exporter processes, we rely on metrics they export. For example, in the case of Redis, we rely on the `redis_up` metric exported by that process.
1. For services that provide neither a Prometheus exporter sidecar, nor an internal scrape endpoint, we may rely on an external service health check, for example, for services HAProxy metrics can provide an insight into the status of the service. This is the least desirable approach.
## Service Availablity Definitions
The definitions of service availability are defined in https://gitlab.com/gitlab-com/runbooks/blob/master/recordings/service_availability.yml
# Service Error Rate
The error rate for a service is a measure of how many errors that service is generating per second.
Note the error rate of a service is the sum of the error rates of each component within that service, so the
metric should be considered relative to the historical value, rather than an absolute number.
This is probably best explained with an example: The `web` service is comprised of `unicorn`, `workhorse` and `nginx` components.
A single error in the `unicorn` component may bubble up and may be reported as three `500` errors - one in `unicorn`, one in `workhorse` and one in `nginx`. The
error rate of the service is the sum of these values, so would report 3 for a single error bubbling up through the layers.
## Service Availablity Definitions
The definitions of service availability are defined in https://gitlab.com/gitlab-com/runbooks/blob/master/recordings/service_error_rate.yml
# Service Operation Rate
The operation rate of a service is a measure of how many requests the service is having to handle per second.
Note the operation rate of a service is the sum of the operation rates of each component within that service, so the
metric should be considered relative to the historical value, rather than an absolute number.
This is probably best explained with an example: The `web` service is comprised of `unicorn`, `workhorse` and `nginx` components.
A single user request may create one request to `nginx`, one request to `workhorse` and one request to `unicorn`. The operation rate, will
therefore reflect three requests, rather than one.
Since each component is reporting metrics separately, it's easier to handle things in this manner, than attempting to correlate multiple
sources to a single request.
## Service Availablity Definitions
The definitions of service availability are defined in https://gitlab.com/gitlab-com/runbooks/blob/master/recordings/service_ops_rate.yml
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment