Consolidate Performance Monitoring Metrics to Realize AIOps Benefits
The Ever-Increasing Importance of Monitoring
Nearly every business today relies upon hardware and software technologies to deliver its products and services to its key business constituents. This trend will only continue to increase as enterprises are aggressively digitizing their business processes to deliver products and services more rapidly, stay competitive, increase productivity and improve their bottom line.
Once the process is online or ‘digitized’ the battle has only begun. Enterprise with online applications that interface with prospects, customers, suppliers, and partners find that they must then continuously and rapidly evolve those applications to keep them competitive in the marketplace.
IT Operations must support these digitization efforts with platforms that are flexible and dynamic, and that rapidly scale to meet emerging to periodic demands.
Along with all the above, the pace of innovation continues to increase. And with each innovation new applications and services (for example Docker containers and micro-services) are created that need to be monitored.
The combination of the dynamic behavior across the stack, and the pace of innovation places both a great deal of importance and a great deal of stress upon IT Operations to deliver the required performance, reliability and capacity in the IT infrastructure.
Google’s Four Golden Signals
In the book “Site Reliability Engineering” the authors (all members of the Site Reliability Engineering Team at Google) recommend that monitoring focus upon the following four “golden signals”:
- Latency – also known as response time, this measures how long it takes for a unit of work to occur. Note that Google’s definition of latency encompasses the entire stack from the transactions through the compute, network, and storage hardware. In fact, at Google the owner of each component of an application system has a budget for how much latency that component contributes to the total.
- Traffic – also known as throughput, this measures how much work gets done per unit of time. Transactions per second, network packets per second, and I/O Operations per Second are all measures of traffic and throughput.
- Errors – this measures all errors across the entire stack, from errors seen by users in browsers to errors at the network layer (dropped packets) to the inability of services or infrastructure to respond to requests in the required time frames.
- Saturation – also known as contention saturation measures the degree to which a service or a piece of infrastructure is “overloaded” and requesting applications and services have to wait in line. Examples of saturation are queues for database connections, web connections, CPU resources (CPU Ready in virtualized environments), memory (leading to memory swapping to dick) and in network devices, and in storage devices.
The Key Vectors of Innovation
The IT industry continues to innovate rapidly in pursuit of ever improved price/performance of infrastructure, flexibility of infrastructure, and the ability to implement business processes in software more quickly at a lower cost. These layers of innovation are shown in the diagram below.
The Modern Enterprise Monitoring Challenge
When critical business services are digitized, they must be monitored to ensure they are working and delivering acceptable performance to the users of each application or business service. Dynamic infrastructure combined with rapid application innovation means that many different points in the stack must be monitored frequently (at least once a minute).
Since all of the metrics that comprise the Four Golden Signals need to be collected from each transaction in each micro‐service all of the way through the entire software stack to the compute, networking and storage hardware, and if this needs to be done once a minute then the enormity of this challenge by necessity creates a need for enterprise to own multiple monitoring tools. It is simply not possible for any one vendor of a monitoring tools to cover the entire stack and keep up with the pace of innovation across the stack.
The Current State of Fragmented Performance Monitoring
The pace of change at each layer of the stack means that dedicated monitoring solutions are necessary in order to keep up with the pace of change and diversity at that layer of the stack. This leads enterprises to often have twenty or more different monitoring tools to cover their stack and the diversity within each layer of the stack.
Fragmented Monitoring in the Enterprise
The Costs of Fragmented Monitoring Data
While each team with dedicated tools may solve their needs in isolation, the needs of the enterprise as a whole cannot be met by fragmented monitoring. Fragmented monitoring creates the following business issues:
- The data is locked up in proprietary databases of the tool vendors making any kind of analysis of the data – especially analysis across tools difficult and error prone
- If all of the tools separately feed their alarms (events) into an event management system, then all of IT is in a constantly reactive posture to events and alarms as they occur.
- It is impossible to run IT as a business since the performance of key applications and transactions is not tied to the behavior of the supporting infrastructure
- Capacity is managed with simple utilization-based rules that cause significant over-provisioning to avoid the risk of application performance and reliability issues.
For the above reasons, it is imperative that that monitoring metrics be consolidated.
Options for Consolidating Monitoring Metrics
For guidance on how to consolidate monitoring metrics we can start by looking at another class of IT data that is already getting consolidated in many enterprises – logs. Many enterprises consolidate their application, system and a security logs in logs stores like Splunk, Elastic and others. From this, enterprises have learned:
- If your log store vendor charges by the amount of data ingested per day, ingesting all of your logs gets really expensive really quickly. And, that expense is extremely unpredictable. Most enterprises end up filtering the logs that go into the log store to try to manage this expense.
- A log store is great for after the fact forensics. If you already know that something has gone wrong with an application, a service or a piece of infrastructure you can query the log store over the relevant time period and make useful progress quickly.
- Log stores are not designed to store objects, and their metrics and relationships over time. But, relationships are key to understanding the dependencies across applications systems.
The second option is to consolidate your metrics into a time series database. This has the advantage of being optimized for the analysis of metrics over time. But time series databases alone simply capture pairs of objects and metrics, and like log stores have no way of knowing what is related to what which is a critical requirement for proactive analytics.
The third option is to use a platform for AIOps like OpsDataStore that is specifically designed to capture the objects, metrics and relationships in a modern virtualized and cloud based environment.
Understanding Objects, Metrics and Relationships
In order to be able to implement AIOps across your stack, the metrics that come from the tools that monitor each layer of your stack need to be captured from the tools that monitor that layer, and then crucially, relationships need to be created across those tools and platforms.
OpsDataStore Relationships Across Tools and Platforms
The OpsDataStore Architecture for AIOps
OpsDataStore believes that the metrics, objects and relationships that are needed to have an effective AIOps strategy come from many different platforms (VMware vSphere, AWS, EMC and NetApp enterprise infrastructure) and many different monitoring tools which span transactions (AppDynamics, and Dynatrace) and the behavior of applications and services over the network (ExtraHop).
These various streams of metrics then need to be consolidated into a real time big data store. The relationships between the objects that provided the metrics needs to be established at the time that the metrics are ingested (no after the fact discovery). Then the metrics need to be baselines and aggregate and statistics on the metrics need to be calculated (mean, min, max, standard deviation, covariance). Finally then metrics need to be transformed into a datastore that is optimized for queries by commonly used BI tools like Tableau. This enables the valuable analytics which are listed below and which are part of OpsDataStore.
OpsDataStore uniquely collects the metrics across the application, compute, network and storage stack and automatically and continuously creates the relationships between them. This enables the following unique AIOps analytics:
- Enterprise topology maps are created for each transaction showing how that transaction is mapped into its supporting virtual and physical infrastructure over time.
- Proactive analytics are produced that identify deterioration from normal performance in transactions and infrastructure. For example, ranking which transactions peak 5 minute response time is increasing at the fastest rate over the last day, or week, or even month.
- Cross-Stack Anomaly Analytics tie infrastructure performance to the performance of critical business transactions to automatically determine the abnormally performing infrastructure being used by an application transaction at the time it is also performing abnormally.
- Infrastructure resource capacity is assessed and ranked in terms of the performance of the business transactions running on the related infrastructure.
As companies continue to transition towards full digitization, monitoring becomes equally important to ensure a seamless user experience and improve their bottom lines. To do this effectively, the monitoring of metrics across your key tools and platforms must be consolidated into a big data back end that establishes the relationships across transactions and their supporting virtual and physical infrastructure over time. Once this is done, sophisticated and easy to consume analytics can be produced that allow IT to optimize performance, throughput, capacity and reliability.
Learn more about how OpsDataStore enhances AIOps analytics.