Commit 29a33872 authored by Andrew Newdigate's avatar Andrew Newdigate
Browse files

Moving towards simplifying the alerts.

In future we want to pagerduty on critical alerts, not on alerts that
are set to `pager=pagerduty`. This is a first step in that direction.
Pagerduty alerts are critical alerts. Critical alerts are pagerduty
alerts.

As a next step, we can simply start routing alerts to pagerduty when
severity is critical and drop the pager attribute altogether.
parent 19c24edd
......@@ -7,6 +7,7 @@ groups:
labels:
channel: production
severity: critical
pager: pagerduty
annotations:
description: ProcessCommitWorker sidekiq jobs are piling up for the last 10
minutes, this may be under control, but I'm just letting you know that this
......
......@@ -7,7 +7,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
annotations:
description: |
{{ $labels.type }} has lost redundancy. Only {{ $value }}% of servers are online.
......
......@@ -37,6 +37,7 @@ groups:
> 1500
for: 5m
labels:
pager: pagerduty
severity: critical
annotations:
description: This might be causing a slowdown on the site and/or affecting users.
......@@ -58,6 +59,7 @@ groups:
for: 5m
labels:
severity: critical
pager: pagerduty
annotations:
description: This might be causing a slowdown on the site and/or affecting users.
Please check the Triage Dashboard in Grafana.
......
......@@ -6,6 +6,7 @@ groups:
for: 1m
labels:
severity: critical
pager: pagerduty
annotations:
description: This usually means that we lost an NFS mount somewhere in the fleet,
check https://prometheus.gitlab.com/graph?g0.range_input=1d&g0.expr=rate(rails_git_no_repository_for_such_path%5B1m%5D)%20%3E%200.001&g0.tab=0
......
......@@ -20,6 +20,7 @@ groups:
expr: gitlab_com:last_wale_basebackup_age_in_hours >= 48
for: 5m
labels:
pager: pagerduty
severity: critical
annotations:
description: WALE basebackup syncs to S3 might be not working. Please follow the runbook
......
......@@ -6,6 +6,7 @@ groups:
for: 30s
labels:
severity: critical
pager: pagerduty
annotations:
description: Prometheus instance for monitor.gitlab.net is down. Please check
the Prometheus service.
......@@ -17,6 +18,7 @@ groups:
for: 30s
labels:
severity: critical
pager: pagerduty
annotations:
description: monitor.gitlab.net is down. nginx or/and grafana services are down.
runbook: troubleshooting/monitor-gitlab-net-not-accessible.md
......
......@@ -67,7 +67,7 @@ groups:
expr: rate(postgresql_errors_total{type="statement_timeout"}[5m]) > 3
for: 5m
labels:
severity: warn
severity: critical
pager: pagerduty
channel: database
annotations:
......@@ -101,7 +101,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: Replication lag on server {{$labels.instance}} is currently {{
......@@ -113,7 +113,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: Replication lag on server {{$labels.instance}} is currently {{
......@@ -125,7 +125,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: Replication lag on server {{$labels.instance}} is currently {{
......@@ -139,7 +139,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: Replication lag on server {{$labels.instance}} is currently {{
......@@ -151,7 +151,7 @@ groups:
for: 10m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: XLOG is being generated at a rate of {{ $value | humanize1024 }}B/s
......@@ -229,7 +229,7 @@ groups:
for: 30m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: "The dead tuple ratio of {{$labels.relname}} is greater than 5%"
......@@ -350,6 +350,7 @@ groups:
- alert: PostgreSQL_PGBouncer_maxclient_conn
expr: rate(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[1m]) > 0
labels:
pager: pagerduty
severity: critical
channel: database
annotations:
......@@ -362,7 +363,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: >
......
......@@ -7,6 +7,7 @@ groups:
for: 5m
labels:
severity: critical
pager: pagerduty
annotations:
description: GitLab sentry has been down for 1 minute!
runbook: troubleshooting/sentry-is-down.md
......
......@@ -6,6 +6,7 @@ groups:
for: 20m
labels:
severity: critical
pager: pagerduty
annotations:
description: Sidekiq jobs are piling up for the last minute, this may be under
control, but I'm just letting you know that this is going on, check http://dashboards.gitlab.net/dashboard/db/sidekiq-stats.
......@@ -17,6 +18,7 @@ groups:
for: 20m
labels:
severity: critical
pager: pagerduty
annotations:
description: The new_note sidekiq queue is piling up. This likely means that users are not receiving email notifications
for comments on issues.
......@@ -27,6 +29,7 @@ groups:
for: 1h
labels:
severity: critical
pager: pagerduty
annotations:
description: There have been over queued 2000 Sidekiq jobs for the last hour.
Check http://dashboards.gitlab.net/dashboard/db/sidekiq-stats. Note that
......
......@@ -7,6 +7,7 @@ groups:
for: 30m
labels:
severity: critical
pager: pagerduty
annotations:
description: Check SSL for specified nodes and consider reissuing certificate.
runbook: troubleshooting/ssl_cert.md
......
groups:
- name: testing.rules
rules:
- alert: HawaiianHugs24
expr: node_load1{job="node",type="git"} > 1000
for: 1m
labels:
severity: critical
annotations:
description: "This one we should really worry about, really... it's git"
node_load: '{{ $value }}'
something_else: "this annotation will be common to all, so it should be shown anyway"
title: This is the git nodes doing things
- alert: HawaiianHugs25
expr: node_load1{job="node",type="api"} > 1000
for: 1m
labels:
severity: warn
annotations:
description: 'Nothing to worry about, at least not too much'
node_load: '{{ $value }}'
title: Alerting is hard
something_else: This is the git nodes doing things
......@@ -21,6 +21,14 @@ def validate_rule(alert_file_path, rule)
raise StandardError, " #{alert}: rules must contain a `severity` label" unless labels["severity"]
raise StandardError, " #{alert}: rules contains an invalid `severity` label: #{labels["severity"]}" unless ["info", "warn", "error", "critical"].include?(labels["severity"])
if labels["pager"]
raise StandardError, " #{alert}: rules contains an invalid `pager` label: #{labels["pager"]}" unless labels["pager"] == "pagerduty"
raise StandardError, " #{alert}: only severity critical errors should page" unless labels["severity"] == "critical"
else
raise StandardError, " #{alert}: critical alerts should be configured to send to pagerduty" if labels["severity"] == "critical"
end
end
def validate_group(alert_file_path, group)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment