Excerpt from our Command Center Handbook: Proactive IT Monitoring
by Abdul A. Jaludi
Chapter 10 - case study # 1
Poorly performing infrastructure command center
Management responsible for the data centers and corresponding command centers across three continents servicing areas that included Europe, the Middle East, and Africa requested help after the business division heads complained of poor service and unhappy customers. The command center was producing metrics that showed they were meeting their obligation of creating a trouble ticket and notifying support within 15 minutes for each alert, but service disruptions were occurring more frequently. Senior management within the business divisions was receiving an increasing number of complaints from customers who either could not access the firm’s applications or were able to sign onto the application but its response time was so slow that they quickly gave up. The senior business heads were not happy with IT for several reasons:
1. They were hearing of the problems directly from the customers rather than from Operations.
2. Servers would fail, making the applications running on them unavailable. Management was even more upset after learning that warning alerts were generated and sent to the command center but no notifications or actions were taken to correct the cause of the alert, leading to a system failure.
3. Application-generated critical alerts had been sent to the command center indicating something wasn’t working properly, but developers wouldn’t hear about those issues until much later, after customers began complaining.
A service agreement had been implemented as a quick fix after a prior compliant by the business divisions of poor service. At that time, it was found that notifications were not being made for a large number of alerts. Rather than performing a detailed examination of the issues, an agreement was reached between the two sides that a trouble ticket would be created and a notification would be made within 15 minutes following contact instructions provided by the support teams.
Initially, there was a noticeable improvement in service, but over time the number of alerts increased. The increase gave the command center staff no time to verify that notifications were being responded to or that alerts were being addressed.
Senior IT management wanted to know why there were disconnects between the command center staff, which showed they were meeting their service-level agreements, and the businesses, which were taking a hit on revenue as a result of the high number of outages and poor customer service.
With threats of outsourcing the data centers and command center operations from the business divisions, data center management needed a detailed analysis of the problem and a permanent solution in place as soon as possible.
Because the command center claimed to meet its goal, the first step was to look at the goal itself to see whether it was aligned with the businesses objectives.
The command center’s goal was to ensure a trouble ticket was created and notification was sent to support for every alert received within 15 minutes.
The first problem was the goal:
1. It was not a valid goal, but rather a service-level agreement. The command center had no stated goals and only one implied goal: to ensure the service-level agreement regarding trouble tickets and notifications was met.
2. The agreement was not aligned with the businesses objectives. In fact, the command center had no goals that involved satisfying the clients it served.
3. The command center’s only obligation was to create the trouble ticket and make the notification, period. If the notification was never received, was ignored, or went to the wrong person, e-mail, or phone number, made no difference to the command center; it was irrelevant. Resolution of the alert was irrelevant as well. The only obligation was to create a trouble ticket and make the notification within 15 minutes of the alert appearing on the monitoring screens, regardless of whether anyone actually received the notification within that timeframe.
The next part of the analysis was to talk to all of the parties involved to hear what they perceived as the most pressing issues: alert (event) management, incident management, support teams, and the command center monitoring staff.
Alert (event) management staff concerns:
· Excessive workload—The alert management team was being flooded with requests for new monitoring, threshold modifications, monitoring removal, and a host of other alert-related requests from the support teams and the command center staff.
· Insufficient staffing—Staffing had not kept pace with increases in the monitored environment.
· Outdated tools—Staff processed work requests using Excel spreadsheets and a very cumbersome request process that most other teams had moved away from in favor of an automated self-service process.
In addition to monitoring configuration changes, several team members were devoted to reviewing and approving or rejecting change control records for applications that required monitoring changes. The workload increased substantially as the number of monitored systems increased, requiring staff additions each year just to keep the backlog down to a two-week level.
Incident management staff concerns:
Once notified of an outage, the incident management team did what was needed to restore service. The complaints here were:
· Most calls came directly from the help-desks after customers began complaining, instead of beforehand from the command center.
· Delays in reaching the correct support personnel due to missing or incorrect contact information often added hours to the incident resolution process.
· There was slow response from the command center when seeking operational support.
Support staff concerns:
The application and system support teams both had the same complaints:
· The support teams were being flooded with alert-related e-mails and trouble tickets from the command center. Initially the support staff investigated each of those alerts, but over time they were staying up all night investigating alerts that belonged to someone else, were for development servers and applications, or could be ignored because of improper settings. With no way to distinguish which alerts were valid, and to get some sleep, the support teams began ignoring all alert-related trouble tickets and e-mail notifications, responding only when called directly.
· Requests to the alert management team for alert modifications took weeks to complete or were never made.
· To enable or remove monitoring was a long, arduous process. For existing systems it was easier to leave monitoring as is and to instruct the monitoring staff to ignore alerts.
· Change requests were often delayed due to outstanding approvals from the alert management team. Numerous calls, e-mails, and escalations by the change creators were often required to get a change record approved before the cutoff window.
> Change control procedures required the alert management team to review and approve every change request where a system or application was being modified so that appropriate alerting modifications could be made.
Monitoring teams concerns:
· Excessive number of alerts. Almost seven alerts for every monitored system, for a total of 55,000 alerts per month, were going to the command center monitoring screens. At any point in time there were at least 15 pages of alerts on the monitoring screens, making it impossible for the command center staff to do anything more than open a trouble ticket and make the initial notification.
· Missing or incorrect contact information made it difficult to meet the 15-minute requirement.
· Constant alerts meant the monitoring staff was busy from the moment their shift began until it ended. The environment had become like a factory assembly line, with staff constantly repeating the same steps over and over.
· About 90% of the trouble tickets were closed by support staff with instructions for the monitoring staff to ignore the alert.
· Hundreds of alerts would flood the consoles from scheduled changes, making it easy to miss any valid alerts that appeared during that timeframe.
The main problem identified was the number of alerts (more than 55,000 per month) going to the command center monitoring screens. With 15 pages of alerts on the monitoring screens, it was impossible to determine which alerts were nuisance alerts and which required follow-up actions to prevent an outage or to restore service. Anything that goes beyond the first page becomes hidden, and in all likelihood will lead to an extended outage. Monitoring screens should be empty or close to it at all times. The more than 55,000 alerts signified serious problems with alerting and monitoring.
1. A look at alerts and event management revealed the following:
· An excessive number of non-actionable (false) alerts due to
o Improper thresholds
o Scheduled changes
o Decommissioned servers
o Test and development systems
o Pre-production testing
> Alerts during development testing were being flagged as production and showing up on the command center screens.
> Monitoring and alert generation was not implemented for many new production systems due to the alert management backlog.
2. No automation. All requests to add, remove, or modify monitoring and alert settings were manually performed and took over two weeks to complete. Simple threshold adjustments were treated as least critical and took longer than two weeks to complete or were completely ignored.
3. No linkage between the change, asset, and alert management applications.
4. No monitoring and alerting standards or best practices.
5. Monitoring was broken on 25% of the servers. As a result, no alerts were generated when any adverse conditions occurred on those servers or any applications hosted on them.
The majority of the workload for the alert management team was divided between reviewing change management records and adding or removing monitoring due to server changes.
Most of the alerts going to the monitoring screens were non-actionable alerts due to scheduled changes, improper asset management classification, or development system alerting.
Several courses of action were begun:
1. Create default best practice monitoring profiles for any new system. For existing systems with multiple profiles, a common default was created.
a. With default profiles in place, monitoring could be automated for any new systems or servers added to current applications.
2. Create automation to integrate the asset management systems with the alert management system.
a. Would automatically enable monitoring on new production systems.
b. Would automatically enable or remove monitoring on servers added to or removed from existing applications.
c. Would prevent alerts for decommissioned and non-production systems from going to the command center production screens.
3. Create automation to integrate the change management system with the alert management system.
a. Would prevent alerts caused by scheduled changes from appearing on the production monitoring screens during the change start and end times.
b. Change records that required alerts to be suppressed no longer required approval or any action by the alert management team, eliminating a large chunk of its weekly workload.
4. Implement a new policy requiring a response from the appropriate support team for every production alert sent to the command center monitoring screens.
a. The monitoring teams were no longer allowed to ignore alerts.
i. Responses such as “Notify me when it reaches 90%” were no longer accepted. Action had to be taken to correct the condition causing the alert or the alert thresholds had to be modified to alert at the appropriate condition.
b. Some of the alerts were caused by individuals implementing changes without an approved change record.
i. This policy helped to tightened change management controls by highlighting unauthorized changes.
5. Automate trouble ticket creation.
a. Trouble tickets would automatically be created for every production alert sent to the command center monitoring screens.
6. Automate alert notification and escalation to the proper support teams.
a. Instant notification would be made by a Voice Response Unit to the on-call support member, with a call to the next person in line or to the team manager if there was no response. Final notification would go to the command center if no one from the support teams responded.
7. Integrate support contact and escalation information with the asset and HR management systems to ensure data is always current and accurate.
8. Create separate rules for production and non-production alerts.
a. Production alerts would go directly to the command center monitoring screens with a corresponding trouble ticket and support notification requiring an immediate response from the support teams.
b. Non-production alerts would go directly to the support teams. Alert processing rules would be configurable by the support teams to utilize the preferred notification methods and timeframes for each individual.
9. Create QA screens to prevent pre-production alerts from flooding production screens and to provide the opportunity to eliminate ambiguous and incoherent alerts.
a. This would be where monitoring and alerts get tested, tweaked, and adjusted before a system or application is moved into production.
10. Implement command center shift turnover meetings that include managers from the outgoing and incoming shifts as well as the senior command center manager to highlight current ongoing issues that may not seem significant at the time but may turn into major problems if not addressed in a timely manner.
As the above improvements were being implemented:
· The number and duration of business outages began going down dramatically.
· Avoidable business outages were virtually eliminated, and the duration for all other outages was cut dramatically.
· Almost overnight, requests for new monitoring decreased from several hundred per week with a two-week backlog to a dozen or so. Alerting and monitoring became an automatic process once a system was classified as production in the asset management system.
· New monitoring requests had previously required the requestor to submit a form using a complicated Excel spreadsheet. Those forms were extremely confusing and required the requestor to guess the appropriate threshold settings. The formation of best practice standard profiles allowed automation to be implemented, which alleviated the workload for the support teams as well as the alert management team.
· By the end of the first year, the number of non-actionable alerts virtually disappeared from the command center monitoring screens.
· Over a three-year period the number of monitored systems increased by 600%, and the size of the alert management and monitoring teams decreased by 25%.
· The alert management team had enough resources to create detailed training courses for the support teams and to expand their offering for business and database monitoring, helping to further reduce other problems that may affect customers.