Commit 29a33872 authored by Andrew Newdigate's avatar Andrew Newdigate

Moving towards simplifying the alerts.

In future we want to pagerduty on critical alerts, not on alerts that
are set to `pager=pagerduty`. This is a first step in that direction.
Pagerduty alerts are critical alerts. Critical alerts are pagerduty
alerts.

As a next step, we can simply start routing alerts to pagerduty when
severity is critical and drop the pager attribute altogether.
parent 19c24edd
......@@ -7,6 +7,7 @@ groups:
labels:
channel: production
severity: critical
pager: pagerduty
annotations:
description: ProcessCommitWorker sidekiq jobs are piling up for the last 10
minutes, this may be under control, but I'm just letting you know that this
......
......@@ -7,7 +7,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
annotations:
description: |
{{ $labels.type }} has lost redundancy. Only {{ $value }}% of servers are online.
......
......@@ -37,6 +37,7 @@ groups:
> 1500
for: 5m
labels:
pager: pagerduty
severity: critical
annotations:
description: This might be causing a slowdown on the site and/or affecting users.
......@@ -58,6 +59,7 @@ groups:
for: 5m
labels:
severity: critical
pager: pagerduty
annotations:
description: This might be causing a slowdown on the site and/or affecting users.
Please check the Triage Dashboard in Grafana.
......
......@@ -6,6 +6,7 @@ groups:
for: 1m
labels:
severity: critical
pager: pagerduty
annotations:
description: This usually means that we lost an NFS mount somewhere in the fleet,
check https://prometheus.gitlab.com/graph?g0.range_input=1d&g0.expr=rate(rails_git_no_repository_for_such_path%5B1m%5D)%20%3E%200.001&g0.tab=0
......
......@@ -20,6 +20,7 @@ groups:
expr: gitlab_com:last_wale_basebackup_age_in_hours >= 48
for: 5m
labels:
pager: pagerduty
severity: critical
annotations:
description: WALE basebackup syncs to S3 might be not working. Please follow the runbook
......
......@@ -6,6 +6,7 @@ groups:
for: 30s
labels:
severity: critical
pager: pagerduty
annotations:
description: Prometheus instance for monitor.gitlab.net is down. Please check
the Prometheus service.
......@@ -17,6 +18,7 @@ groups:
for: 30s
labels:
severity: critical
pager: pagerduty
annotations:
description: monitor.gitlab.net is down. nginx or/and grafana services are down.
runbook: troubleshooting/monitor-gitlab-net-not-accessible.md
......
......@@ -67,7 +67,7 @@ groups:
expr: rate(postgresql_errors_total{type="statement_timeout"}[5m]) > 3
for: 5m
labels:
severity: warn
severity: critical
pager: pagerduty
channel: database
annotations:
......@@ -101,7 +101,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: Replication lag on server {{$labels.instance}} is currently {{
......@@ -113,7 +113,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: Replication lag on server {{$labels.instance}} is currently {{
......@@ -125,7 +125,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: Replication lag on server {{$labels.instance}} is currently {{
......@@ -139,7 +139,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: Replication lag on server {{$labels.instance}} is currently {{
......@@ -151,7 +151,7 @@ groups:
for: 10m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: XLOG is being generated at a rate of {{ $value | humanize1024 }}B/s
......@@ -229,7 +229,7 @@ groups:
for: 30m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: "The dead tuple ratio of {{$labels.relname}} is greater than 5%"
......@@ -350,6 +350,7 @@ groups:
- alert: PostgreSQL_PGBouncer_maxclient_conn
expr: rate(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[1m]) > 0
labels:
pager: pagerduty
severity: critical
channel: database
annotations:
......@@ -362,7 +363,7 @@ groups:
for: 5m
labels:
pager: pagerduty
severity: warn
severity: critical
channel: database
annotations:
description: >
......
......@@ -7,6 +7,7 @@ groups:
for: 5m
labels:
severity: critical
pager: pagerduty
annotations:
description: GitLab sentry has been down for 1 minute!
runbook: troubleshooting/sentry-is-down.md
......
......@@ -6,6 +6,7 @@ groups:
for: 20m
labels:
severity: critical
pager: pagerduty
annotations:
description: Sidekiq jobs are piling up for the last minute, this may be under
control, but I'm just letting you know that this is going on, check http://dashboards.gitlab.net/dashboard/db/sidekiq-stats.
......@@ -17,6 +18,7 @@ groups:
for: 20m
labels:
severity: critical
pager: pagerduty
annotations:
description: The new_note sidekiq queue is piling up. This likely means that users are not receiving email notifications
for comments on issues.
......@@ -27,6 +29,7 @@ groups:
for: 1h
labels:
severity: critical
pager: pagerduty
annotations:
description: There have been over queued 2000 Sidekiq jobs for the last hour.
Check http://dashboards.gitlab.net/dashboard/db/sidekiq-stats. Note that
......
......@@ -7,6 +7,7 @@ groups:
for: 30m
labels:
severity: critical
pager: pagerduty
annotations:
description: Check SSL for specified nodes and consider reissuing certificate.
runbook: troubleshooting/ssl_cert.md
......
groups:
- name: testing.rules
rules:
- alert: HawaiianHugs24
expr: node_load1{job="node",type="git"} > 1000
for: 1m
labels:
severity: critical
annotations:
description: "This one we should really worry about, really... it's git"
node_load: '{{ $value }}'
something_else: "this annotation will be common to all, so it should be shown anyway"
title: This is the git nodes doing things
- alert: HawaiianHugs25
expr: node_load1{job="node",type="api"} > 1000
for: 1m
labels:
severity: warn
annotations:
description: 'Nothing to worry about, at least not too much'
node_load: '{{ $value }}'
title: Alerting is hard
something_else: This is the git nodes doing things
......@@ -21,6 +21,14 @@ def validate_rule(alert_file_path, rule)
raise StandardError, " #{alert}: rules must contain a `severity` label" unless labels["severity"]
raise StandardError, " #{alert}: rules contains an invalid `severity` label: #{labels["severity"]}" unless ["info", "warn", "error", "critical"].include?(labels["severity"])
if labels["pager"]
raise StandardError, " #{alert}: rules contains an invalid `pager` label: #{labels["pager"]}" unless labels["pager"] == "pagerduty"
raise StandardError, " #{alert}: only severity critical errors should page" unless labels["severity"] == "critical"
else
raise StandardError, " #{alert}: critical alerts should be configured to send to pagerduty" if labels["severity"] == "critical"
end
end
def validate_group(alert_file_path, group)
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment