A status board describes the services the way they run — in full effect, with some connotations, perhaps with a defect or failure. Whichever may be the case of the situation at that moment, though — a status board reflects the current state.
I believe it does not solely reflect the current state. I believe it may as well take in to account the changes associated with the services, past, present and future, and therefore the level of confidence in future continuation if you will. After all, a service that is at 100% now is not likely to continue to remain at 100% if there’s a thousand changes pending application.
Most “status boards” either give you none of the actual current information, or require some level of manual intervention. Either it’s some JSON that someone POSTs some place, or it’s a derivative of a sub-set of information about the environment being pushed back to a central status board node.
So, here’s where status is supposed to come from in my view;
First, you take it from wherever the people are that are on the receiving end of any issue occurring. Second, you factor in how the people that are supposed to resolve the issue manage their systems;
- Nagios was already configured to check your systems’ status and alert you where necessary.
If not, you have a problem beyond representing your current status on some board.
- Munin was already configured to tell you something about long-term trends.
If not, you will have a future problem beyond representing your then status on some board.
- Puppet was already configured to provide you with a means to dictate the desired state, no matter the discrepancies between the current state and the desired (future) state.
This included things to run (Puppet), monitor (Nagios), alert about (Nagios), and things to trend (Munin). Please create, in your mind, a full mesh between the purposes, and de-duplicate for the sake of efficiency. After all, Munin can provide Nagios with information Nagios does not currently have a proper plugin for.
These three pieces of infrastructure that — by the way, I sincerely hope you have sorted these out, or their equivalents if you have had to deal with some inferior alternative — that provide you with the information you need for an actual status board that requires no further manual input.
The status board system is not supposed to reach out to either of the Nagios, Munin or Puppet (DB) systems. It’s a principle that make less secure systems (those with a public attack surface definitely qualify) never connect inward to a system that is supposed to be more secure (because, for example, the more secure system contains privileged information as to how, where and using which credentials it can obtain certain information).
In other words, a more secure system is supposed to connect to a less secure system, and not the other way around. To illustrate, fractional replication for an LDAP tree from a master in the internal network to a slave in the perimeter network in order to validate which recipient email addresses are valid is supposed to be a push operation by the master, and never a pull operation by the slave.
OpenLDAP, take note please (pretty please <puppy face/>)!
Red Hat Directory Server, excellent job! Thank you!
Now that you have your meta information in three places, collecting the information in to a single status board becomes the pretty part;
Nagios can tell you the current status, Puppet can tell you the status for the near future, and Munin (or, actually, its RRDtool database files) can tell you the foreseeable future.
I seek to combine these three metrics in such a way, that either the 5-nines myth is busted completely, or can be proven to hold true.
That’s not completely honest, I intend to include a factor that is certainty and confidence on new software releases. Four factors it is.
This sprint’s goal though, as is the first milestone for the endeavour, is representing the health reported by Nagios categorized by Puppet classes included for each node in the environment(s). I have this under review for Friday’s retrospective to review.