Businesses spend huge amount in firefighting activities and it is crucial to resolve these issues as fast as possible because it directly impacts productivity. ITIL is a framework that includes a set of best practices for service support and delivery. Problem Management is one such ITIL process to prevent incidents from occurring. Businesses often confuse this with Incident management process due to their similarities and many organizations do not have Problem Management process. Incident management deals with resolving issues as soon as possible and restoring services back to normalcy whereas the primary goal of Problem Management is to provide permanent resolution and prevent these incidents from occurring with the help of Change Management process.
It is fundamental to understand the differences between these two before implementing any of these processes. Problem Management helps businesses in cost reduction by identifying and preventing critical incidents. This means that there is no service interruption and therefore no productivity loss. While striving for service excellence, it is inevitable that businesses must deliver seamless support and offer extraordinary service to their users. Problem Management is a part of ITIL service operations lifecycle.Problem Management is closely aligned with other ITIL modules such as Change Management, Release Management in order to plan and deploy a permanent fix to the recurring incident. Most organizations do not understand the importance of Problem Management when they implement ITIL. But it is significant to understand the business value and benefits of this process.
In this Problem Management guide, let us look at a detailed study of objective, scope, process flow, techniques, benefits, feature checklist and KPIs associated with Problem Management process along with suitable examples.
Problem Management is an IT Service Management (ITSM) process to prevent problems and incidents from occurring and resolve known problems with a permanent solution. Recurring incidents give rise to a Problem. The objective of Problem Management is to diagnose the root cause of repeated incidents. Root Cause Analysis (RCA) is an important step during Problem Management process. Incident Management aims at restoring the services as fast as possible and if the same incident occurs frequently that have higher impact, then it is moved to Problem Management team to analyse the root cause and find a solution. Problem Management either provides a workaround for the problem or provides a permanent solution.
Problem Management uses a common database to track problems. It starts with problem diagnosis and try to provide a workaround or a permanent solution. A known error database (KEDB) is maintained for open problems. KEDB is used to track known issues and it involves changes to Configuration Items (CIs). Problem Management and Configuration Management talk to each other in sharing CI related details. Whenever there is a problem reported, it is vital to check CI involved and update the CI if needed in order to resolve the issue permanently. Information consistency across these modules is important to faster resolution of Incidents, problems and also to enable timely deployment. To remain competitive, businesses must have speed to market and agility.
Uninterrupted service is a dream come true for any service desk. In reality, issues do arise and it is the responsibility of service desk to mitigate the impact and respond as fast as possible. However, end users expectations have increased and they demand easily accessible service desk touchpoints. The primary objective of Problem Management is to identify and troubleshoot repeating incidents by finding the root cause. Its aim is to proactively eliminate problems from occurring and also find out a workaround or a permanent solution. Problem Management reduces the number of incidents by being proactive. It also reduces the long term cost associated with firefighting activities and service downtime. End user satisfaction improves eventually and realize real business and customer value.
Identify the root cause of repeated incidents
Provide a workaround in short term to known problems
Provide a permanent resolution to frequently occurring incidents
Deliver proactive service support
Problem – One or more repeated incidents with an unknown cause. Problem is the root cause of one or more incidents
Incident - Unplanned interruption or service disruption that affects normalcy and quality of service. Incident is the effect and Problem is the cause for incidents
Known error - A problem with known root cause but no permanent solution. A workaround is provided for a known error
Root cause - Cause of a problem and root cause analysis (RCA) is a method to identify the actual root cause. Eliminate the root cause permanently
Workaround - Short term, temporary solution to a known error
KEDB - Known Error Database is a common repository to maintain all known errors. KEDB is checked whenever incidents occur frequently.
Problem Management belongs to the ITIL service operation. It interacts with number of other processes in ITIL service lifecycle. Within ITIL service operation, it closely interacts Incident Management to address repeated incidents and prevent major incidents from occurring. When it comes to service design, problem history is crucial to design Availability Management. Knowledge Management that belongs to service transition is helpful to record known errors and their workarounds as knowledge base articles. While performing RCA, Problem Management interfaces with Knowledge Management process to look out for potential solution that is already available. Finally, Proactive Problem Management does Continual Service Improvement to improve the service quality.
Problem Management is crucial at every stage of ITIL service lifecycle. Therefore, it is a costly mistake to ignore this process while setting up ITIL process at your organization. While choosing a service desk solution, ensure that the solution supports all features needed to perform Problem Management process.
ITIL Problem Management follows a sequence of steps to identify, diagnose and resolve problems. There is a predefined framework to execute Problem Management. This process flow helps organizations to do Problem Management in the right way without confusing with Incident Management. The scope of the process flow are as follows
Problem detection
Problem logging
Investigation and diagnosis
KEDB
Resolution
The first step is to detect the problem and this can be done in a variety of ways. Tier I team escalates incidents that are unable to resolve. A problem can also be recorded by reviewing the Incident report. When one or more incidents occur with an unknown cause, then a problem record is created. In certain cases, a reported incident is clearly associated to a known problem. If the problem record does not exist, then create a new problem record and link related incidents. Problem detection saves a lot of resources by identifying the problem at the right time so that diagnosis gets easier. The symptoms of a problem include
Escalation from Level I team of not being able to resolve the incidents
Frequently repeating incidents with similar conditions
Incidents reported by multiple people across organization
Proactive identification of problem based on patterns and alerts from monitoring tool
Every detected problem has to be logged in the problem record for tracking purpose. It is vital to capture problem details such as problem type, description, associated incidents, affected CIs from CMDB, category, user information, status, resolution, closure. This information is vital to tag known errors and manage them in a database. Every problem record has two attributes i.e. impact and urgency. Impact refers to the number of users and CIs affected due to this problem. Urgency refers to how quickly the resolution is needed. Depending on these two factors, Service Level Agreement (SLA) is set which decides the due by date for problem resolution. This information is crucial for Problem Management team to perform root cause analysis. Service desk ticketing system enables Problem logging by capturing all relevant details using a form template. Generating problem reports using this data becomes easier when there is a complete database.
Prioritization and categorization of problem records help in picking the problem record for investigation. During investigation, stakeholders discuss about possible root cause. Problem diagnosis is done once RCA is completed. RCA is carried out using various Problem Management techniques that are available. Investigation involves cross team collaboration and diagnosis is performed by Problem Research team. While investigating a problem record, it is recommended to search in KEDB initially to find out whether it’s a known problem.
Post the diagnosis, problem record could be added to the Known error database (KEDB) or a permanent solution is delivered to close the record. Investigation and diagnosis may result in a workaround to solve the issue temporarily until a permanent resolution is found. Until then, services are restored with the help of a workaround. As soon as a workaround is found out, it is added to the KEDB. It is important to maintain the KEDB upto date. Whenever any incident or problem arises in future, service desk agent refers this database first to check for possible workaround.
Problem resolution involves other ITIL modules such as Change Management and Release Management. In order to fix the problem permanently, a new change has to be raised. Change Management handles evaluation, planning and execution of changes. Problem Management team raises the request and submits Request for Change (RFC). Change team evaluates the impact and planning is carried out. A suitable Change Management process is used such as standard, normal or emergency type. Release Management is responsible for actual deployment of approved changes. This involves packaging the change and testing in sandbox environment before it is rolled out to the production environment. It is necessary to document the resolution provided to the user and the Problem record is associated to the respective Change and Release records. Closure can be handled through automation
There are different Problem Management techniques available. Let us discuss some of the popular techniques that can be implemented easily.
Discussing the problem statement and possible causes with key stakeholders. This involves group discussion and encourages full house partIcipation.
Round robin discussion that involves all members
Generates high volume of ideas in a shorter time span
Faster method and produces diverse set of ideas
A logical approach to problem-solving that includes with problem definition and elaboration. Possible causes are vetted, then tested and finally the true cause is identified.This is a systematic four phase Root Cause Analysis (RCA) for complex problem analysis. Kepner Tregoe (KT) is applicable for both proactive and reactive problem management. It involves problem analysis as well as potential problem analysis.
Situation Appraisal
Problem analysis
Decision analysis
Potential problem analysis
Possible Causes | Evidence | Result |
---|---|---|
Memory issue |
Memory leakage |
Cause |
Server speed issue |
Log files |
Cause |
Data retrieval Issue |
Configuration issue |
Not a cause |
Cause Effect analysis describes relationships between a problem and its possible causes. This method is also known as Ishikawa or fishbone diagram that analyses primary and secondary causes of a problem. Causes have various categories such as people, product, process and partners. For example: Network outage might have causes such as router malfunction, configuration error, natural disaster etc. This method is used for reactive problem management. Therefore, it is important to define the problem statement precisely.
List down all possible causes for an effect / situation
Suitable for complex problem analysis
Includes many possible causes and contributing factors
Discuss action items to improve the process
5 why strategy is a simple technique to find out the root cause by asking subsequent “why” questions. It is one of the six sigma techniques to identify the actual root cause of a problem and to take appropriate countermeasures to prevent from occuring in the future. It understands the relationships between various root causes. However, it is significant to frame the questions properly to derive at the actual the root cause. Asking why question five times is just a rule of thumb and it varies depending on the problem complexity.
Reactive Problem Management reacts to recurring incidents by analysing the root cause and providing a long term fix. It is crucial to identify these repeating incidents as problems. Incident Management aims at restoring the services as fast as possible and therefore, often miss out on the underlying cause of incidents. Incident Management team transfers such incidents to Problem Management team for a detailed research and analysis. This handover is crucial and timing is more important in order to maintain service integrity.
Incident Management team should pass on information such as incident category, affected CIs, criticality and impact. Reactive Problem Management process consumes these information and does a detailed RCA, submits RFC and updates the problem record in KEDB. Reactive Problem Management starts with checking incident patterns and it includes reviewing past incidents in the service desk.
Problem control – Happens during investigation phase as discussed above. This deals with root cause analysis and identifying the actual cause of the problem. Converts problems to known errors.
Error control – Happens during resolution phase. This involves limiting known errors from KEDB. It finds permanent solutions for available known errors.
Proactive Problem Management acts as a gatekeeper in continuously identifying potential issues and avoiding them. It does not wait for incidents to occur and aims to prevent incidents/problems from occurring in the future. This process is a preventive technique that involves big data and trend analysis. Patterns are identified from historical incident and problem data and potential issues are avoided. This requires past incident data analysis, major events, asset health check and situational appraisal. Kepner Tregoe analysis is an example of proactive Problem Management technique that deals with data analysis. Examples include maintenance activities, periodic audit.
Reduces firefighting activities
Prevents major IT failures and thus acts as a gatekeeper
Improves efficiency and maintains productivity
Problem management starts once Incident management is completed. A problem record can be created either from one or more incidents or on its own. It deals with analysis of recurring incidents and finding their root cause. Incident management shares information such as incident description, user impacted, asset impacted, criticality. Problem Management uses these information to identify whether it is a known error or not. Therefore, Incident Management acts as a prerequisite to Problem Management in most cases.
If Problem Management is unable to find a permanent solution, then it is followed by Change Management to execute new changes. Problem Management RCA is crucial for Change Management to understand the associated risk and urgency. Change Management process finds a permanent fix by rolling out new changes. Problem Management simplifies change evaluation phase by providing a detailed RCA. Change Management process decides the change schedule depending on problem impact and criticality. Change advisory board (CAB)
involves relevant stakeholders from Problem research team to assess the planned change. Known errors or Known problems result in a Request for Change (RFC). Relevant problems are associated to the change record for better execution.
Recurring incidents demand asset health check in order to find out the cause. While Problem Management owns root cause analysis, it is essential to work closely with Configuration Management team to understand asset details, asset owner and its interdependencies with other assets, impact and vendor related information. Problem research team with the help of these details suggests the next steps i.e. to execute a new change in the configuration item, CI or provide a suitable workaround. These two modules are closely connected to each other and Problem analysis phase revolves around Configuration Items (CIs) in order to minimize the impact.
Problem Management leverages Knowledge Management by accessing the central repository and solution database. Knowledge base articles are fundamental to trend analysis. For both proactive and reactive Problem Management, knowledge base articles help in speedy resolution. Relevant knowledge articles are associated to problem record. Known error database along with workarounds are stored in knowledge base as well. KEDB is a subset of broader Knowledge Management system. After a permanent solution is found out, it is stored in Knowledge Management for future reference.
Learn from past historical incidents. Analyze patterns and eliminate major incidents with data analysis. This saves a lot of time and resource.
Integrate Problem Management with other ITIL modules for information sync and consistency. Associations across Incident, Problem and Change records help in easier reference.
Assign a dedicated Problem Manager with clear role and responsibilities to execute Problem Management process as per ITIL standards. Problem Manager acts as a liaison between Incident Manager and Change Manager.
Plan an effective communication strategy across Change Management, Incident Management and Configuration Management. As soon as a workaround is found out, it is essential to communicate this to related incident owners. This in turn gets communicated to affected end users. In order to be effective, leverage automation capabilities available in the service desk tool.
Understand proactive as well as reactive approach to Problem Management. Both are mandatory and useful in certain scenarios. But it is important to understand the differences between two approaches and the process flow.
Understand that Problem Management has its own SLA and it is important to resolve before the due date. SLA is decided based on priority. Incident priority is transferred to Problem priority.
Learn various Problem Management techniques to find out the actual root cause.
Don’t think Problem Management similar to Incident Management. Both ITIL processes work hand in hand but they are entirely different. However, Problem Management team learn from incident data and past records.
Don’t reinvent the wheel. The first step is to check the Known Error Database (KEDB) which is a repository of known problems along with workarounds. This is integral to Problem Management.
Don’t forget to document the resolution in detail. Problem resolution is used by multiple teams such as Incident Management to communicate to end users. Therefore, be elaborate in documentation.
Don’t ignore any step in Problem Management process flow. Follow every step as described above.
Problem Management leverages Knowledge Management by accessing the central repository and solution database. Knowledge base articles are fundamental to trend analysis. For both proactive and reactive Problem Management, knowledge base articles help in speedy resolution. Relevant knowledge articles are associated to problem record. Known error database along with workarounds are stored in knowledge base as well. KEDB is a subset of broader Knowledge Management system. After a permanent solution is found out, it is stored in Knowledge Management for future reference.
No. of problem records reported
Average resolution time
Percentage of problems resolved within SLA
Total no. of known errors
Problem backlog - No. of problems unresolved
Total no. of Incidents associated to problems
Percentage of problems with identified root cause
Percentage of problems with a workaround
Problem Manager role does not exist in many organizations but it is fundamental for companies to realize the importance of this ITIL methodology. A Problem Manager role acts as a middleman between Incident and Change Management.
Responsible for Problem Management process
Lifecycle management of problems
Maintains the quality and integrity of Problem
Acts as a liaison across different teams such as Incident Management, Change Management
Defines and maintains the process flow
Continuous review and improvement of Problem Management process
Coordinates between various stakeholders to identify the root cause of a problem and find a workaround or solution
Prevents incidents from occurring Responsible for production and maintenance of KEDB
Responsible for production and maintenance of KEDB
Ensures that the right resources are available to investigate, identify root cause of a problem
Trend analysis of past historical incident data
Ensures problems are resolved within SLA
Problem Manager logs RFC when necessary
Periodical reports on performance of Problem Management team and cost benefit analysis of RCA
Having discussed the various aspects of Problem Management, it is necessary to highlight the business benefits of Problem Management.
Improved service availability - Proactive Problem Management ensures uninterrupted service and avoids major incidents
Consistent service quality - A high quality service is essential for service excellence
Reduced costs - Major incidents are avoided and subsequent costs are saved
Improved customer satisfaction - Problem Management provides a permanent solution to recurring incidents that improves end user satisfaction
Improved overall productivity - Finding RCA and fixing an issue permanently ensure seamless business operation
Sorry, our deep-dive didn’t help. Please try a different search term.