Commit c05eb61f authored by Pablo Carranza's avatar Pablo Carranza

Add section of things to do while on-call

parent 87ab1c65
......@@ -68,6 +68,10 @@ The aim of this project is to have a quick guide of what to do when an emergency
## How do I
### On Call
* [Common tasks to perform while on-call](howto/
### Deploy
* [Get the diff between dev versions](howto/
# So you got yourself on call
To start with the rigth foot let's define a set of tasks that are nice things to do before you go any further in your week
By performing these tasks we will keep the [broken window effect]( under control, preventing future pain and mess.
## Things to keep an eye on
### On-call log
First check this log to familiarize yourself with what has been hapening lately, if anything is on fire it should be written down there in the **Pending actions** section
### Alerts
Start by checking how many alerts are in flight right now, to do this:
- go to the [fleet overview dashboard]( and check the number of Active Alerts, it should be 0. If it is not 0
- go to the alerts dashboard and check what is [being triggered]( each alert here should point you to the right runbook to fix it.
- if they don't, you have more work to do.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
### Nodes status
Go to your chef repo and run `knife status`, if you see hosts that are red it means that chef hasn't been running there for a long time. Check in the oncall log if they are disabled for any particular reason, if they are not, and there is no mention of any ongoing issue in the on-call log, consider jumping in to check why chef has not been running there.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment