Automated Alert Grouping: Resolve incidents faster with richer context
With Ashutosh Dwivedi, Ashutosh Pandey, Sachin Adlakha, Suryakant Singh,
Vishnu Parimi, & Anusha Jha
According to a study, an average organization logs about 1200 IT Incidents per month out of which 5 will be critical. Given the increase in complexity of digital infrastructure, an incident is likely to impact multiple resources and services generating a ton of alerts. The mean time to repair for critical incidents is 5.81 hours, which reduces if there are fewer incidents to manage in the first place. On average, a further 7.23 hours are spent on root cause analysis, which is successful 65% of the time.
Automated Alert Grouping, powered by Freddy, intelligently groups all related alerts and attaches them to the root incident. As a result, DevOps teams can work on fewer incidents and cut through the alert noise quickly, determine the key issues causing the incident faster, and take actions to prevent service degradation/outage confidently.
What are you spending time on – sorting or solving?
Modern digital infrastructure is complex with numerous interdependent components across infrastructure, network, application code and databases. This interconnected nature often causes regular issues with one component to snowball into major incidents.
To illustrate this, let’s take an example. Bob wants to purchase a pair of shoes at Footwear.com but is unable to complete the transaction due to delay in loading the checkout page. Footwear.com’s IT team receives multiple alerts related to checkout page delays – latency alert from the web server hosting the checkout page, failed database requests alert indicating that the checkout page application is unable to fetch Bob’s shipping information, and high memory usage alert from the database due to an unscheduled background process. All these alerts are linked to the root problem of high memory utilization on the database and its inability to respond to queries from applications.
In absence of Automated Grouping, the DevOps team must triage all these alerts individually and determine that they are related to the same incident, thus spending valuable time that could be used to conduct root-cause analysis and resolve the incident.
Alert2Vec: Freddy’s secret sauce for automated alert grouping
Freddy, Freshservice’s AI engine, uses Alert2Vec to group alerts related to an incident. Alert2Vec uses aggregation and time-based correlation to group alerts. Aggregation takes into account alert attributes such as resource name, metric etc. to group related alerts. Time-based correlation is a machine learning model that looks for consistent alert patterns across similar incidents that have occurred in the past.
The basic premise of Alert2Vec is that the alerts that frequently occur together must be correlated and are highly likely caused by the same underlying issue. Thus, grouping them into a single incident would allow the agent to quickly uncover the issue resulting in lower noise and lower MTTR.
Alert2Vec ML algorithm trains on historical alert data to create vector representations (or embeddings) of those alerts in a multi-dimensional space. Historical alerts that are similar (consistently occurring within the same time window, triggered from dependent resources or from the same source) are clustered together in the muti-dimensional space as calculated by the distance between the vectors. When new alerts are triggered, the Alert2Vec algorithm converts them into their vector representations. Alerts related to the same incident share similar context (e.g., occurring within a time window) and hence, will be converted to vectors that fall within the same cluster and grouped together.
Referring to the example above, a spike in memory utilization on the database due to an unscheduled background process led to several related alerts – failed database requests, high latency, and checkout page – impacting upstream services. Alert2Vec will take the historical pattern of these alerts and analyze their content (same resources, integration names, etc.) as well as their co-occurrence patterns to group them together. Thus, the DevOps engineer responsible for this service can look at all the alerts grouped together within a single incident to quickly understand and fix the high memory utilization on a downstream resource (database) that is impacting the upstream services.
Automated Grouping: An evolving model that fits your organization
Each organization has a unique digital infrastructure with an equally unique alert behavior. So, while the Freddy ML engine comes pre-loaded with innate intelligence, it can be trained to comprehend the underlying patterns driving this behavior by manually attaching an alert to an open incident and establishing a correlation. Similarly, an incorrect correlation can be pointed out by manually detaching a Freddy-attached alert from an incident. This continuous learning builds Freddy’s repository of alert and incident patterns unique to an organization.
The algorithm then uses these patterns to suppress unimportant alerts, group together notifications that are indicative of an issue, and attach incoming alerts to open incidents. When suitably trained, Freddy improves the signal to noise ratio and reduces noise up to 50%. Minimal noise coupled with rich incidents enable NOC and DevOps teams to make fast and effective decisions.
Note: Automated Grouping using Freddy ML algorithm is available only to accounts with a large number of notifications and resources. Freshservice will notify you if you qualify to use this feature. You will then need to ‘enable’ Automated Grouping from the Admin pane in the tool.