24May2018

3 Challenges of Using Unrelated Metrics in Digital War Rooms

Troubleshooters Hunt for Answers Using Events, Forensic Logs, and Unrelated Point Metrics

Most IT operations teams rely on reactive event management systems and after the fact log based forensics to diagnose performance and availability problems and issues with critical online services and applications. The approach, while almost universally used, has 3 key challenges that are barriers to war room success:

  1. Problems associated with too many issues
  2. Solving problems requires too many separate domain experts
  3. Difficulties associated with trying to prevent service impacts

The Root Cause of the Problem: Unrelated Metrics

Digitization (or digital transformation) means more critical business services are implemented in software and are continuously enhanced – at an increasing rate. The subsequent proliferation of online applications and services drives the acquisition of even more tools to monitor these services. The challenge is further amplified by the simultaneous and accelerating rate of change in underlying computing infrastructures with multiple, dynamically changing environments including virtualization of compute, network and storage environments.

Recent innovations such as cloud, containers and micro-services complete the trifecta creating a perfect storm – an ever-accelerating rate of change in both the number and type of monitoring tools required to manage the resulting complexity. To further complicate matters, every separate monitoring tool collects its own data and stores it in its own database – so the resulting metrics are completely unrelated. This creates three significant challenges to fast and efficient problem resolution when something goes wrong:

  1. Triage problems: Too many different tools generating too many separate and distinct events, alerts, logs and performance metrics making it difficult to quickly pinpoint issues
  2. Manual analysis: Disparate tools and data mandate human, domain expertise resulting in large war room teams and time-consuming issues resolution
  3. Reactive vs Proactive: So much time is spent in triage and manual problem solving, it prevents strategic investment in truly forward-looking automated performance analytics

The result, today’s war room processes resembles a ‘Cat and Mouse Hunt’: Each application owner reactively looks for anomalies in their tool’s metrics and then tries to infer whether or not application anomalies are in fact causing the problem. Each domain expert in the underlying technology stack reactively looks for anomalies in their tool’s metrics to try and infer if their area of responsibility at fault. And nobody knows with certainty what is related to what at the time of the problem, let alone with any foresight.

Attempts to use time-series correlation and after the fact statistical analytics are fundamentally limited by the fact that correlation is NOT causation. Relation of data beats correlation of events.

Putting an End to the Cat and Mouse Hunt with Related Data

OpsDataStore is a unique relationship-driven platform that connects the metrics across your tools, platforms and infrastructure. It includes a set of actionable analytics to revolutionize your war room, problem management, and preventative analytics processes.

OpsDataStore’s patent-pending topology mapping and self-learning analytics provide the only real-time solution that continuously captures and maps your applications to their infrastructure with a complete understanding of both relationships and performance over time.

By leveraging time-based performance metrics and relating this data to an automatically and continuously accurate topology map, OpsDataStore enables truly proactive IT Operations. The resultant operational insights provide technical and business owners with a single pane of glass that can proactively report on Service Level Agreements and RCA – all in one open, extensible platform.

Learn more about the OpsDataStore platform and 5 AIOps Analytics that will streamline your war room and get to the root of issues quickly.

Read Now