9April2017

Announcing Deterministic Root Cause

OpsDataStore is pleased to announce a dramatic addition to our existing 1.2 product, Deterministic Root Cause. Leveraging our patent pending Dynamic Topology Mapping engine, allows OpsDataStore to do anomaly analysis on factual relationships over time. Since we know, for example, upon which virtual server, host and datastore a transaction is running, and since we know this for a fact (i.e., not via correlation), we can then perform accurate pair-wise anomaly analysis on transactions and their supporting virtual and physical infrastructure.

A Brief History of Statistical Root Cause

The idea of having monitoring software watch business critical online systems, and not only tell us when something is wrong (response time is higher than desired), but also tell us what caused the response time degradation (root cause) has been somewhat of a holy grail in the software industry since we started moving workloads off of mainframes onto various distributed systems in the early 1990’s.

Many approaches to automated root cause have been tried. Those include:

  • Neural Net’s – Back in the 1999-2000 time frame, several vendors brought automated root cause solutions to market based upon neural net technologies. These all failed because the neural nets had to be trained as to what was normal, and when normal changed they had to be retrained.
  • Simple Bayesian Statistics (Correlation) – This approach involved taking a sample of the data, building a statistical model and then using the model to understand anomalies. It failed for two reasons; 1) having to build a model every time the data changed was not sustainable, and 2) it turned out that correlation was not the same thing as causation
  • Complex Self-Learning Statistics – This approach had real promise, because it got rid of the requirement to train or build a model. But in order for it to be accurate, it needed an underlying service model (what was related to what) as an input. This was most often pulled from the CMDB, but when systems started changing every hour and every minute (due to virtualization, DevOps and Agile), the CMDB’s were out of date and this approach became un-maintainable.
  • Event Correlation – Many enterprises have large and very expensive legacy event correlation systems (example Tivoli NetCool). While these systems may have worked to a certain extent when the world was reasonably static, they have failed to keep up with the pace of innovation. They also fall prey to the fact that in order for a metric to become an event a human has to set a threshold, and each human sets those thresholds differently.
  • Manager of Managers – The idea here was something that tied into the consoles of all of the point monitoring tools and that integrated the results of each of the consoles. This failed because each point tool had its own way of determining what to display, so it was impossible to generalize across that disparate set of inputs.

In summary, all of the above approaches failed to deliver the three things that customers expected of root cause products:

  1. Alarms that mattered, without missing things that should have been alarms, and without providing alarms that should not have been alarms (false alarms).
  2. The root cause of the alarm
  3. The information required to actually address the problem

The Root Cause Challenge Today

The core reason that the root cause problem in the industry has not been solved so far, is that nature of the problem keeps changing. It keeps changing because the pace of innovation in our industry keeps accelerating. The current state of what needs to be managed is depicted in the image below. The key points are:

  • Software is now changing in production constantly due to DevOps and Agile
  • The software is now more complicated due to a proliferation of languages and runtime environments
  • The entire infrastructure is virtualized and is being reconfigured on the fly often driven by automation

The problem depicted below defies solution by any single monitoring vendor for the simple reason that pieces of the problem are whole company problems, and that the problem keeps changing due to the pace of innovation requiring new approaches to instrumentation and monitoring. For more background on this problem, read this interview up on Read IT Quick.

The Increasingly and More Difficult and
Important Problem to Solve (Click to Expand)

The Need for Real-Time Topology Mapping

The key insight here is that a CMDB populated by an after-the-fact discovery process like Application Discovery and Dependency Mapping (ADDM) is a legacy approach that stands no chance of keeping up with the pace of change depicted in the image above.

What is needed instead is a system that discovers the topology map for everything as it is related to everything else automatically, in real-time and that stores these maps over time (a time series of relationships).

The OpsDataStore Dynamic Topology Mapping engine automatically and continuously calculates the relationships between the transactions and their supporting infrastructure, and the infrastructure and its applications and transactions in real-time and then stores these maps over time. The map below is created for every transaction, application, and piece of virtual and physical infrastructure continuously and automatically.

The Map for Transactions and their Physical
and Virtual Infrastructure (Click to Expand)

It is the existence of the map that knows exactly what is related to what right now, and for every 5 minute interval in time, back in time that makes an advance in Root Cause possible. The advance is possible because for the very first time, you do not have to rely upon a static service model or an out of date CMDB to get your dependency map for the transactions you care about.

OpsDataStore Root Cause

Since OpsDataStore knows for a fact what is related to what right now, and over time, it is possible for OpsDataStore to do anomaly analysis on these relationships. This makes the anomaly analysis process much simpler and much more accurate.

Below is an image of the Root Cause tab in the OpsDataStore Console. Here is how it works (click on the image to expand it):

  • Select a Time Interval (Last 20 days)
  • Pick a Primary Object Category (Transaction)
  • Pick a Primary Metric (Average Response Time)
  • Pick a Baseline (Mean) and how much above the Mean should constitute a Warning and an Alarm

OpsDataStore then automatically does the following work for you:

  • Finds each transaction in the selected time window that has response times above the thresholds selected above.
  • Walks the topology map to find the virtual and physical infrastructure that supports that transaction
  • Finds infrastructure metrics that are also above the thresholds for those metrics
  • Shows you the warnings (yellow) and alarms (red) over time for each transaction and its supporting infrastructure.

In the example below we are looking at one time point where:

  • The response time of the Static Asset transaction (monitored by AppDynamics in this case) was 1 second which was above the alarm threshold of .74 seconds
  • And the CPU Ready (a measure of CPU Contention) of the Host upon which the VM which is hosting the transaction is running (a two step relationship)  was 4.37% which was above its alarm threshold of 3.34%

Furthermore it is now very easy to find patterns of mis-behavior in the infrastructure that affect your business critical transactions. You can see that there is a pattern of response time issues caused by CPU Contention on this host.

Deterministic Root Cause Across Time and Different
Monitoring Tools (Click to Expand)

Pinpointing the Root Cause

From the Root Cause tab we can know that CPU Ready issues on the host are causing the transaction to be slow. But from the Root Cause tab alone, do not know what is causing the CPU Ready on the host to be high. For that we have to look at the Related Metrics tab and configure it to look at the CPU Ready for all of the VM’s running on that host, to see which one is causing the problem.

The top graph in the image below is the CPU Ready of the Host over time. The bottom graph is the CPU Ready of the VM’s running on that host over time. It is very clear that the VM in the light purple is what is causing the high CPU Ready on the host. That VM happens to be one of the VM’s that does builds of the OpsDataStore software, and running builds on the host that is running the production transactions caused the problem. The obvious solution is to vMotion the build VM off of this host.

Root Cause Drill Down (Click to Expand)

Summary

The OpsDataStore Dynamic Topology Map engine allows for a simple and accurate approach to automated root cause. This new Root Cause tab is available in the OpsDataStore console now.