28March2017

Set Your Monitoring Data Free

We very recently had a great conversation with a very large global enterprise. That company told us that their two strategic initiatives for monitoring in 2017 were:

  1. To set their monitoring data free from the tools that produce this data
  2. To make the resulting “freed up” data consumable to people in whatever tools that they wanted to consume the data in

We then engaged in a conversation about how OpsDataStore is uniquely positioned to help this enterprise achieve their goals. This blog post is about what we talked about with this customer and how we are going to be able to help them, and if you have the same goals, how we can help you.

What Monitoring Data?

Every enterprise has many (between 10 and 300) different tools that produce monitoring data. So what should you focus upon? The key is to focus upon the combinations of monitoring data that satisfy your target use cases. Let’s have a look at the Architecture for Data Driven IT Operations below.

As you can see there are many types of monitoring data, many sources of monitoring data, many use cases for the combinations of this data, and many required ways to visualize this data. Common use cases include:

  • Making decisions about which resources to replace. For example, you can replace that hosts that are doing the least work, and which are consuming the most power (and therefore costing you the most to own).
  • Getting true end-to-end visibility and root cause. No one tool can see from the users’ actions and their transactions all of the way through the physical and virtual infrastructure. A combined set of related data makes this possible for the first time.
  • Applying big data analytics to capacity allows for a completely different perspective upon capacity since it allows you to compare things that are otherwise in management silos. If you are a VMware customer this approach lets you compare things across all of your vCenters. If you have an APM tool, this lets you compare things across the instances of the APM tool and their databases.
  • Understanding how transactions are related to their supporting virtual and physical infrastructure allows you for the first time to do deterministic root cause across your entire stack.
  • The cost of running your environment can be analyzed in light of the utilization of the resources and the behavior of your key applications and transactions.

The crucial point is that when you “set your monitoring data free” is that you want people to be able to use it to meet THEIR needs, without IT (you) having to do back-flips to help them meet those needs.

 An Instrumentation Architecture

One of the things that monitoring has always lacked is an architecture. An architecture for monitoring would specify what metrics are important to collect at each layer of the stack and specify how each set of metrics relate to everything else. An example instrumentation architecture is below.

The Required Product Architecture

This is where things get challenging. Once you decide to combine monitoring data from N tools every minute you have created a big data problem. The combination of data from your APM tools and your infrastructure is more than any single monitoring tool can handle. Furthermore, if you want this data to be ingested (consumed) in real time, processed (enriched and baselined) in real time, and made consumable in real time you need a world class real-time low latency architecture to pull this off.

Below is the architecture of the OpsDataStore platform for data driven IT Operations.

Here is how our product architecture is uniquely suited to allow you to “set your monitoring data free”:

  • The OpsDataStore Data Collector SDK allows any source of data to be consumed via a plug-in to the SDK. We have existing plug-ins for VMware vSphere, AppDynamics, Dynatrace, ExtraHop, and the Intel DCM data. We have vendors, customers and VAR’s writing additional plugins.
  • The Object Model (more on this below) establishes the relationships between the disparate metrics streams. So OpsDataStore is not a “stupid metric store” in the sense that log management products are “stupid log stores”.
  • The Ingest Pipeline puts the data and the relationships in Cassandra, a high speed scale out metric store. Additional load can be handled simply by creating more Cassandra nodes.
  • The raw data is then processed (the middle box) in a set of Spark jobs that calculate means, aggregations, standard deviations, co-variances, and baselines for each metric. Spark is also a clustered scale-out system like Cassandra.
  • Finally, the enriched data is written to a high speed decision support database, the “RAM DB” in the right block of the diagram above. This allows the enriched and combined data to be consumed in tools like Tableau, Qlik, Birst, Excel or anything else that can talk to an ODBC or JDBC driver. The RAM DB also clusters and scales out like Cassandra and Spark.

The bottom line of the architecture is that is the only data architecture in the management software industry that can consume and process all of your monitoring and capacity metrics at scale AND let you consume them in the tool of your choice.

The Relationships are the Key

If you are collecting logs from N different sources, it is not possible to know anything about how those sources of data relate to each other. That is why log stores are “stupid log stores” because the underlying data does not lend itself to deterministic relationships.

However in the case of IT Operations and Applications Performance Metrics it is possible (if you have our patent pending Dynamic Object Model) to establish the relationships between objects and metrics from disparate tools at ingest time. This uniquely allows OpsDataStore to establish the topology map across your entire IT environment every few minutes. The map below is created on the fly, and stored over time.

The key is that OpsDataStore discovers the relationships between your transactions monitored by market leading APM tools like AppDynamics and Dynatrace and their infrastructure every 5 minutes. The map above is updated every five minutes. No one else does this.

What Does “Freed Up” Data Look Like?

The short answer is that when you “set your monitoring data free”, it should be consumable in any tool that any user wants to use to solve any problem that is important to them. So there is no one answer to this question. Here are some examples.

The relationship between CPU Ready (a measure of contention for CPU resources) on Virtual Machines and the Transaction Response Time of the transactions running on those Virtual Machines in Excel (click on the image to expand it).

The slowest Datastores (Datastores with the highest Latency in Tableau). This shows the power of being able to consume data across N vCenters and rank things across those vCenters (click on the image to expand).

The resource utilization or contention for any group of virtual or physical servers that you care about. Below we are showing you the CPU Ready (how much time the virtual server is waiting to get CPU allocation) for the VM’s running an instance of OpsDataStore in a VMware environment.

Metrics for one object in the context of related objects. For example, the CPU Ready of a Host, compared to the CPU Ready of the VM’s running on that host. This lets you clearly see which VM is causing the CPU contention on the host and which VM’s are being impacted by the contention.

Metrics can be analyzed in light of the topology for something that you care about like a transaction. Below is an example of a topology map that OpsDataStore creates every 5 minutes from a combination of the data from AppDynamics and the data from VMware vSphere. It shows you exactly where your transactions of interest are running in your physical and virtual infrastructure.

From the topology map above we can go directly to the metrics for that map for any range of time. In the image below we have filtered the metrics to show just the performance metric for the application (response time) and the performance metric for the datastore that the transaction runs on (disk latency).

Finally we can do deterministic root cause on topologies. In the example below, we have configured the Root Cause tab in the OpsDataStore console to take the following steps:

  • Use a three month time of day, day of week maximum baseline for all metrics. This baseline calculates the average maximum for the last 12 weeks for each hour of the day and day of the week.
  • Compare the transaction response times for all transactions to this baseline.
  • When the transaction response time is 50% more than the baseline, walk the topology map for the transaction.
  • Find metrics for the infrastructure that supports that transaction that are also 50% above their baseline.
  • When both transaction response time and an infrastructure metric (like memory swap in and swap out rate in the example below) are more than 50% above their baselines turn the corresponding cell in the grid red.

This for the first time allows you to see patterns of mis-behavior in the infrastructure that are affecting the transactions.

Summary

By organizing the data from multiple tools like AppDynamics, Dynatrace, ExtraHop, Intel, and VMware in an easy to access, high performance RAM based column store, OpsDataStore easily enables multiples ways of visualizing the data in the tool of your choice. This leads to Data-Driven IT Operations, and Data-Driven Decisions in IT Operations.