Monday 31 January 2011

Dashboards for Production Monitoring

We've been using CA Wily Introscope to monitor one of our production systems for several years. Some dashboards had been developed when the tool was first installed and these have been in use ever since. These dashboards provided useful information, but had been built from a technical perspective and didn't really reflect the way that we usually think about our system. For example, the top level dashboard showed items like 'JMS Subsystem' and 'Java Heap'. When one of these indicated an issue, it wasn't really very clear which service was being impacted… and in some cases whether there was really a problem at all.

Following the experience with Wily for performance test monitoring (see last month's posting) I decided to spend some time creating some new dashboards that were a better match for the way we think about our system - i.e. as a collection of services each one of which might be working well or having some problems. I therefore wanted a top-level dashboard with one traffic light indicator for each of our key services and drill-down into a more detailed dashboard for each individual service to allow problems to be tracked down. I will return to this later.

This posting describes what I wanted to do, what I did and what I learned in the process. It's mostly about Wily but other tools are available so I hope (dear reader) that the principles described would apply reasonably well to a different tool.

Individual Service Dashboards

To figure out whether an individual service is working properly we need to monitor several things…
  • What is the response time of the service and is it within acceptable limits?
  • What is the throughput for the service?
  • For a JMS messaging service, how many messages are sat in the queue?
  • For a JMS service, does the queue have one or more listeners connected?
  • Are there any messages in the error queue?
  • Is the application reporting an abnormal number of errors?
All of the above are available in what Wily terms the 'front end' or in the JMX data for the relevant JMS destination.

Our application is clustered, so in most cases I needed to create a Metric Group to bring together the numbers for a particular metric from each member of the cluster. For most of these I then created an Alert to contain the threshold values that would trigger an amber or red warning.

Sometimes I was able to find a Service Level Agreement (SLA) to set the alert thresholds. In cases where there was no SLA I used the Wily Investigator to get a picture of the average and maximum values for the metric over the past month and set thresholds based on this.

For some metrics (e.g. throughput) it was not appropriate to create an Alert. For the error alerts on our user interface I needed to ask Wily to count how many occurrences happened in a given period of time (e.g. 5 minutes) because in many cases a small error rate is quite normal and acceptable.

The service dashboards typically contain a graph to track the values of each metric across all managed servers plus a traffic light showing the status of the alert. In most cases I found it best to bind the graph to the alert rather than the metric group so that I got an amber and a red line on the graph to show the trigger points for each alert like the example below.


Laying out the dashboards was challenging in some cases and for some more complex services I needed to resort to multiple layers of drill-down to show all of the graphs that were needed. In some cases I did this by having a higher level dashboard with 'throughput', 'response time' and 'queue backlog' panels and providing drill-down to more detailed dashboards from here. For other services it was more appropriate to break the service down into functional elements (sub-services) and provide a second level drill-down dashboard for each of these. Here's an example.


Top Level Dashboards

Wily allows us to combine individual alerts into higher level summary alerts - in multiple tiers if necessary. This is exactly what I did to create the top level dashboard and also for the summary dashboard where I broke things down into sub-services. The top level dashboard with one traffic light per service is shown below. I also added a CPU usage graph below the traffic lights for information (and because there was some free space).


Tricky Things

So far so good, but not everything went totally as I wanted…
  • I couldn't find a way to combine metrics using any kind of logic - for example if I wanted to raise an alert if a queue had zero listeners AND had a non-zero message count.
  • I have not so far found any way to turn cumulative metrics (e.g. events since startup) into a number of events per interval. This would be very useful for some JMX metrics that are only available as a cumulative number.
  • For WebLogic JMS queues it is really necessary to know both the 'current' message count and the 'pending' (i.e. being processed or buffered for delivery) count and to sum these together. It is possible to do this using Wily calculators but this gets complicated in a cluster because we really need a separate calculator per metric per cluster member… which is pretty time-consuming to set up. In practice I have ignored the pending count in most cases - except for those parts of the application where it is really critical.
  • Debugging summary alerts can be a bit tricky - I found one that kept going amber or red on the top level dashboard but after drilling down into the detailed dashboard everything was green. In the end I traced it to a duplicate alert that was misconfigured and not shown on the dashboard but it took a lot of manual checking to track this down.
  • Wily can't 'see' everything. This isn't really a tool issue, but it does mean that 'green' on the Wily dashboard doesn't necessarily mean that everything is fine. For example, some parts of our application keep internal 'transaction queues' in application database tables and other parts use background jobs in the database which may go wrong in interesting ways without Wily having any visibility. The Wily Oracle Agent might help here, but at the moment we don't have it installed.
Future Possibilities

As I said at the beginning, my dashboard is focused on monitoring the services supported by our application. As a result, the dashboards focus on the 'front end' metrics. Wily also provides a lot of visibility of the 'back end' (e.g. JDBC calls into Oracle) and infrastructure (e.g. Java garbage collector behaviour) but none of these are used by my dashboards. I figured that if there is a backend issue then it must have a visible effect on the front end service otherwise it isn't really an issue at all. I would therefore expect the dashboard to detect a problem and flag it as an amber or red condition. It would, of course, be nice to be able to drill down and isolate this problem to the database (or whatever), but I've left that as a possible improvement for the future.