Avoiding Gartner’s 10 Worst Monitoring Mistakes
Gartner has published a new note, “Avoid These ‘Bottom 10’ Monitoring Worst Practices”. These practices increase costs, lengthen outages, delay problem corrections, and lead to unhappy customers (the business).
Let’s go through these one at a time and see what can be done to avoid them.
Gartner’s “Bottom 10 Monitoring Worst Practices”
#1 – Blindly Repeating One’s Predecessors Steps
The core problem here is that what has been done before and most likely what is being done currently does not incorporate the realities of digitization (many business processes implemented in software and constantly being enhanced), dynamic and scale out micro-services, new layers of abstraction like containers, network virtualization and storage virtualization, multiple clouds and the scale of the IT estate from the transactions through all of the infrastructure.
The tendency in many organizations is not to plan for how to monitor the modern IT stack in advance, to only react to problems as they occur, and then to scurry around to buy one or more tools that fill perceived gaps.
The correct approach is to put a monitoring architecture in place anchored by a multi-vendor big data platform that is capable of providing IT Operations a “single pane of glass” across all platforms and tools.
#2 – The Cross-Domain Blame Game
Also known as “tool wars” occurs when the war room is convened and everyone uses their tool to exonerate their part of the stack, and to point the finger at some other person’s part of the stack. Again the root cause of this issue is that the monitoring tools are not integrated with each other nor are they integrated with the cloud platforms where the business critical applications run.
OpsDataStore uniquely ties together the monitored items and their metrics from APM tools like AppDynamics, Dynatrace, and ExtraHop with the metrics and monitored items from the cloud platforms like VMware vSphere and AWS. This is shown in the diagram below.
OpsDataStore Relationships across Tools and Platforms
Gartner recommends a “business focused IT Operations Management monitoring strategy”. In a nutshell this means making sure that IT Operations is collecting the metrics about the behavior of the applications and the infrastructure that measure what the business cares about. A great summary of the most important metrics is in Googles book, “Site Reliability Engineering” which advises to focus upon the Four Golden Signals. The basic point is that the business cares about performance (how long is it taking), throughput (how much work is getting done per unit of time), are things failing and whether lack of capacity is causing contention.
The key is that all four signals need to be collected across the stack, which by definition will require many tools, which leads back to the need for a big data back end and a single pane of glass.
Googles Four Golden Signals
The core cause of always being in a fire fighting mode is that IT problem resolution tools and processes are usually entirely reactive (wait for the fire and then try to put it out). This reactive posture is created by focusing upon event management (just waiting for the event is reactive), and then using log analytics to attempt to troubleshoot the event and find out what happened.
A far better approach is to assemble the set of metrics in the Four Golden Signals, relate them across tools and platforms and use analytics on top of the platform of related metrics to be able to anticipate problems and prevent them from occurring entirely.
The continuum from event driven operations to metric driven operations to relationship driven analytics is shown in the diagram below.
Maturing from Events and Metrics to Relationship Driven Analytics
#5 – Tool Sprawl
Gartner categorizes tools as IT Infrastructure Management, Network Performance Management and Diagnostics, Application Performance Management, Artificially Intelligent IT Operations, and Digital Experience Management.
Unfortunately most enterprises own multiple tools in each of the above categories due changing requirements over time, the diversity of the environment, distributed purchasing decision making authority, and the tendency to attach tool decisions to projects that roll out major new systems and applications.
OpsDataStore can really help with a tool consolidation project as the value of the metrics from each tool are readily apparent as soon as those metrics are combined with the metrics from the other tools and platforms in OpsDataStore.
The OpsDataStore Tool and Platform Ecosystem
Most IT organizations approach monitoring from the perspective of making sure that the toolset covers the estate of applications and infrastructure. This is by its nature a tool centric approach.
A far better approach is to work with the business to determine the results or outcomes that IT Operations (and Application Operations) is expected to deliver to the business and to then structure a tool portfolio that delivers these results to the business.
It is also the case that each monitoring tool is designed to meet the needs of the team that purchased the tool, and that use that tool on a daily basis. Tools are not designed to meet the needs of constituents outside of IT Operations which is where an ability to provide the Four Golden Signals in custom dashboards built in BI tools like Tableau is essential.
#7 – Death by a Hundred (Zombie) Metrics
Gartner reports that the average enterprise collects and tracks over 1,500 metrics across the IT environment. This is in addition to the terabytes of log data collected by log analysis solutions.
It is therefore clear that for many IT organizations, tool sprawl leads to metric sprawl. And metric sprawl lead to the need to manually maintain thresholds which then feed lots of false alarm events into an event management system which is supposed to weed out the false alarms.
A far better approach is to focus upon the Four Golden Signals in #3 above and to make this set of important metrics across the stack useful and relevant in a big data back end like OpsDataStore that not only consolidates the metrics, but also tracks the relationships and performs self-learning anomaly analytics across the set of related metrics.
#8 – Auto-Renewing Maintenance/Subscriptions
This issue can be generalized into an observation that many vendors, especially legacy vendors do not exactly have customer friendly pricing and licensing policies. One unique advantage of OpsDataStore is that our pricing is very customer friendly and we enable our customers to pick the best of breed tools to plug into OpsDataStore that have the required functionality and pricing/licensing policies.
#9 – Checklist Mentality
“Does it tick all of the boxes”? Well even if it does tick all of the boxes, does it enable IT Operations to deliver the required level and quality of service to the business that the business demands?
Since no tool will in fact tick ALL of the boxes, this approach is destined to guarantee tool sprawl which in turn guarantees hundreds of zombie metrics.
A far better approach is to map the requirements of the business into the Four Golden Signals. In most cases, IT will find that the performance and throughput of the key transactions and applications is what the business cares most about, and all other tool functionality is a secondary consideration.
#10 – Only Shortlisting Magic Quadrant Leaders
It is rather rich and ironic for Gartner to highlight this issue. Gartner retired the Magic Quadrants for IT Operations Management and Event Correlations and Analytics and has not yet produced new Magic Quadrants for IT Infrastructure Management, AIOPS and Digital Experience Monitoring. Therefore customers face the problems of entire categories of tools missing Magic Quadrants as well as deciding when to pick an innovative and new vendor.
In defense of Gartner it is clear that ITIM, AIOPS and DEM are all undergoing very rapid change and in the case of AIOPS and DEM are very new and as of yet not fully established product categories.
It is clear that the pace of innovation across the stack is creating the need for new monitoring vendors, and for rapid evolution (and heavy investment) for existing monitoring vendors. Consider the diagram below. Only nimble monitoring and sometimes new monitoring vendors need apply. By the time a Magic Quadrant exists the leaders in that quadrant are likely already being disrupted by new vendors solving new problems that the leaders are overlooking.