Commit 40ad16df authored by John Jarvis's avatar John Jarvis

Add alert rules and runbook for haproxy connection errors.

parent 364d1ee0
......@@ -3,7 +3,6 @@ groups:
rules:
- record: environment_type:rails_request_errors:ratio
expr: sum by (type,environment)(rate(rails_requests_completed{status=~"5.."}[1m])) / sum by (type,environment)(rate(rails_requests_completed[1m]))
- alert: HighRailsErrorRate
expr: environment_type:rails_request_errors:ratio * 100 > .5
for: 30s
......@@ -14,44 +13,3 @@ groups:
description: Rails is returning 5xx errors at a high rate for {{ $labels.type }} . Traffic is impacted and users are likely seeing 500 errors.
runbook: troubleshooting/high-error-rate.md
title: High Rails Error Rate on Front End
- alert: HighWebErrorRate
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{backend="web",code="5xx",tier="lb"})
- sum(backend_code:haproxy_server_http_responses_total:irate1m{backend="web",code!="5xx",tier="lb"})
> 0
for: 15s
labels:
pager: pagerduty
severity: critical
annotations:
description: We are having more 5xx returns than any other reply. Web traffic
is being impacted and the service is probably down. Have you thought about
turning it off and on again?
runbook: troubleshooting/gitlab-com-is-down.md
title: High Error Rate on Front End Web
- alert: High4xxApiRateLimit
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{backend="api_rate_limit",code="4xx",tier="lb"})
/ sum(backend_code:haproxy_server_http_responses_total:irate1m{tier="lb"})
> 0.1
for: 5m
labels:
pager: pagerduty
severity: critical
annotations:
description: We are seeing an increase of 4xx errors on the load balancing api
rate limiting backend, more than 10% of http requests for at least 5 minutes.
runbook: troubleshooting
title: High 4xx Error Rate on Front End Web on backend api_rate_limit
- alert: High4xxRateLimit
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{backend!="registry",code="4xx",tier="lb"})
/ sum(backend_code:haproxy_server_http_responses_total:irate1m{backend!="registry",tier="lb"})
> 0.25
for: 5m
labels:
pager: pagerduty
severity: critical
annotations:
description: We are seeing an increase of 4xx errors on the load balancing across
all backends, more than 25% of http requests for at least 5 minutes.
runbook: troubleshooting
title: High 4xx Error Rate on Front End Web
groups:
- name: increased-error-rates.rules
rules:
- alert: IncreasedErrorRateHTTPSGit
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{code="5xx",tier="lb",backend="https_git"}) > 20
for: 15s
labels:
severity: critical
annotations:
description: We are having a high rate of 5xx on https_git backend. It's likely that customers are impacted.
runbook: troubleshooting/high-error-rate.md
title: Increased Error Rate Across Fleet
- alert: IncreasedErrorRateOtherBackends
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{code="5xx",tier="lb",backend!="https_git"}) by (backend) > 20
for: 15s
labels:
severity: critical
annotations:
description: We are having a high rate of 5xx accross other backends (web(sockets)?/api/registry/etc, anything except https_git). It's likely that customers are impacted.
runbook: troubleshooting/high-error-rate.md
title: Increased Error Rate Across Fleet
groups:
- name: haproxy.rules
rules:
- alert: HighWebErrorRate
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{backend="web",code="5xx",tier="lb"})
- sum(backend_code:haproxy_server_http_responses_total:irate1m{backend="web",code!="5xx",tier="lb"})
> 0
for: 15s
labels:
pager: pagerduty
severity: critical
annotations:
description: We are having more 5xx returns than any other reply. Web traffic
is being impacted and the service is probably down. Have you thought about
turning it off and on again?
runbook: troubleshooting/haproxy.md
title: High Error Rate on Front End Web
- alert: High4xxApiRateLimit
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{backend="api_rate_limit",code="4xx",tier="lb"})
/ sum(backend_code:haproxy_server_http_responses_total:irate1m{tier="lb"})
> 0.1
for: 5m
labels:
pager: pagerduty
severity: critical
annotations:
description: We are seeing an increase of 4xx errors on the load balancing api
rate limiting backend, more than 10% of http requests for at least 5 minutes.
runbook: troubleshooting/haproxy.md
title: High 4xx Error Rate on Front End Web on backend api_rate_limit
- alert: High4xxRateLimit
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{backend!="registry",code="4xx",tier="lb"})
/ sum(backend_code:haproxy_server_http_responses_total:irate1m{backend!="registry",tier="lb"})
> 0.25
for: 5m
labels:
pager: pagerduty
severity: critical
annotations:
description: We are seeing an increase of 4xx errors on the load balancing across
all backends, more than 25% of http requests for at least 5 minutes.
runbook: troubleshooting/haproxy.md
title: High 4xx Error Rate on Front End Web
- alert: IncreasedErrorRateHTTPSGit
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{code="5xx",tier="lb",backend="https_git"}) > 20
for: 15s
labels:
severity: critical
annotations:
description: We are having a high rate of 5xx on https_git backend. It's likely that customers are impacted.
runbook: troubleshooting/high-error-rate.md
title: Increased Error Rate Across Fleet
- alert: IncreasedErrorRateOtherBackends
expr: sum(backend_code:haproxy_server_http_responses_total:irate1m{code="5xx",tier="lb",backend!="https_git"}) by (backend) > 20
for: 15s
labels:
severity: critical
annotations:
description: We are having a high rate of 5xx accross other backends (web(sockets)?/api/registry/etc, anything except https_git). It's likely that customers are impacted.
runbook: troubleshooting/high-error-rate.md
title: Increased Error Rate Across Fleet
- alert: IncreasedBackendConnectionErrors
expr: rate(haproxy_backend_connection_errors_total[1m]) > .1
for: 10s
labels:
pager: pagerduty
severity: critical
annotations:
description: We are seeing an increase in backend connection errors on {{$labels.fqdn}} for backend {{$labels.backend}}.
This likely indicates that requests are being sent to servers in a backend that are unable to fulfil them which will
result in connection errors.
runbook: troubleshooting/haproxy.md
title: Increased HAProxy Backend Connection Errors
- alert: IncreasedServerResponseErrors
expr: rate(haproxy_server_response_errors_total[1m]) > .5
for: 10s
labels:
severity: critical
annotations:
description: We are seeing an increase in server response errors on {{$labels.fqdn}} for backend/server {{$labels.backend}}/{{$labels.server}}.
This likely indicates that requests are being sent to servers and there are errors reported to users.
runbook: troubleshooting/haproxy.md
title: Increased Server Response Errors
- alert: IncreasedServerConnectionErrors
expr: rate(haproxy_server_connection_errors_total[1m]) > .1
for: 10s
labels:
pager: pagerduty
severity: critical
annotations:
description: We are seeing an increase in server connectino errors on {{$labels.fqdn}} for backend/server {{$labels.backend}}/{{$labels.server}}.
This likely indicates that requests are being sent to servers and there are errors reported to users.
runbook: troubleshooting/haproxy.md
title: Increased Server Connection Errors
# HAPrpoxy Alert Troubleshooting
## First and foremost
*Don't Panic*
## Reason
* Errors are being reported by HAProxy, this could be a spike in 5xx errors,
server connection errors, or backends reporting unhealthy.
## Prechecks
* Examine the health of all backends and the HAProxy dashboard
* HAProxy - https://dashboards.gitlab.net/d/ZOOh_aNik/haproxy
* HAProxy Backend Status - https://dashboards.gitlab.net/d/7Zq1euZmz/haproxy-status?orgId=1
* Is the alert specific to canary servers or the canary backend? Check canaries
to ensure they are reporting OK. If this is the cause you should immediately change the weight of canary traffic.
* Canary dashboard - https://dashboards.gitlab.net/d/llfd4b2ik/canary
* Canary howto - https://gitlab.com/gitlab-com/runbooks/blob/master/howto/canary.md
## Resolution
* If there is a single backend server alerting, check to see if the node is healthy on
the host status dashboard. It is possible in some cases, most notably the git
server where it is possible to reject connections even though the server is
reporting healthy.
* on the server see the health of the service `gitlab-ctl status`
* for git servers check the status of ssh `service sshd_git status`
* HAProxy logs are not currently being sent to ELK because of capacity issues.
These logs can be viewed in stackdriver. Production logs can be viewed using this [direct link](https://console.cloud.google.com/logs/viewer?project=gitlab-production&authuser=1&minLogLevel=0&expandAll=false&timestamp=2018-10-08T07:43:05.667000000Z&customFacets=&limitCustomFacetWidth=true&dateRangeStart=2018-10-08T06:43:05.918Z&dateRangeEnd=2018-10-08T07:43:05.918Z&interval=PT1H&resource=gce_instance&scrollTimestamp=2018-10-08T07:42:43.008000000Z&logName=projects%2Fgitlab-production%2Flogs%2Fhaproxy)
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment