- 08 Nov, 2018 1 commit
-
-
Andrew Newdigate authored
-
- 06 Nov, 2018 1 commit
-
-
John T Skarbek authored
* This is a bit tricky, each consul node will report that it knows about the primary node * So this alert assumes that there will be quorum * We check to see if there's 0 reporting `passing` * I linked to the main postgresql document due to troubleshooting this being quite tricky * Closes: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5358
-
- 05 Nov, 2018 4 commits
-
-
Andrew Newdigate authored
useful. This MR also removes the concept of `info` criticality alerts which don't make much sense. The `postgres:up` recording relied on `info` alerts but was the only thing to do so.
-
Andrew Newdigate authored
-
Andrew Newdigate authored
-
Hendrik Meyer authored
-
- 02 Nov, 2018 3 commits
-
-
John T Skarbek authored
-
John Skarbek authored
-
Andrew Newdigate authored
-
- 31 Oct, 2018 1 commit
-
-
Tomasz Maczukin authored
-
- 29 Oct, 2018 6 commits
-
-
Andrew Newdigate authored
-
Andrew Newdigate authored
-
Andrew Newdigate authored
-
Andrew Newdigate authored
-
Andrew Newdigate authored
-
Ben Kochie authored
* Drop obsoletely `node_cpu` metric recordings. * Drop `CPUOutlierDetectionOnPrd` that doesn't work due to missing recordings. * Add new 1m rate recordings, with a 1m interval. * Move CPU alerts to new metrics. * Drop environment filter from CPU alerts. * Drop 80% CPU threshold for "High CPU" to avoid alert noise. * Move old 5m alerting to separate rule group.
-
- 25 Oct, 2018 3 commits
-
-
Ben Kochie authored
-
Ben Kochie authored
-
Ben Kochie authored
* Move haproxy recording rules into haproxy rule group. * Move remaining 'recording.yml' into rule group for generic process metrics.
-
- 24 Oct, 2018 1 commit
-
-
Andrew Newdigate authored
-
- 23 Oct, 2018 4 commits
-
-
Dave Smith (.org) authored
-
Dave Smith (.org) authored
-
Andrew Newdigate authored
In future we want to pagerduty on critical alerts, not on alerts that are set to `pager=pagerduty`. This is a first step in that direction. Pagerduty alerts are critical alerts. Critical alerts are pagerduty alerts. As a next step, we can simply start routing alerts to pagerduty when severity is critical and drop the pager attribute altogether.
-
Andrew Newdigate authored
-
- 22 Oct, 2018 3 commits
-
-
Antony Saba authored
-
Antony Saba authored
-
Andrew Newdigate authored
-
- 19 Oct, 2018 1 commit
-
-
Ben Kochie authored
Split the Gitaly rule groups to reduce execution time per group.
-
- 18 Oct, 2018 2 commits
-
-
Ben Kochie authored
* Remove obsolete Prometheus 1.x local storage alerts. * Simplify queries. * Add an alert for rule group evaluation taking longer than 70% of the interval. * Update the slow rule documentation.
-
John T Skarbek authored
* This node is no longer serving as an archive replica * Now he's participating as a node in the cluster * Removes an alert that was crafted for him as an archive replica * Removes him from the exception list of the rest of our alerts
-
- 17 Oct, 2018 6 commits
-
-
Stan Hu authored
-
Andrew Newdigate authored
-
Andrew Newdigate authored
-
Andrew Newdigate authored
-
Andrew Newdigate authored
-
Andrew Newdigate authored
-
- 15 Oct, 2018 2 commits
-
-
Alex Hanselka authored
-
Andrew Newdigate authored
-
- 12 Oct, 2018 2 commits
-
-
Tomasz Maczukin authored
-
John Jarvis authored
-