Guest Column | August 12, 2021

How MSPs Can Mitigate Microsoft 365 Outage Issues

By Rob Doucette, Martello Technologies

Automated alerts help minimize impact of IT outages

Among emerging trends during and post-pandemic is our ever-increasing dependency on cloud services like Microsoft 365. With many still working from home, we are more reliant on these services than ever. Any outage, regardless of its cause, results in disruption to business operations. When outages happen on a regional or global scale, they can severely impact the way we live and work.

Though periodic outages are inevitable, there are ways to mitigate them. Tools that provide ‘early warning’ of an outage and insight into what’s caused it can reduce the negative impact on customers. Taking this proactive stance, rather than the traditional reactive model will reduce downtime, circumvent possible future outages, and increase customer satisfaction.

90% of performance issues are attributed to that enterprise’s own infrastructure and network. It is rarely an issue with Microsoft 365.

As a managed service provider (MSP), it is important not to simply chalk the problem up to a “Microsoft issue” – this leads to overlooking something or missing an instance where a Microsoft best practice has not been followed. Let’s explore the three types of outages, where they occur, their respective issue resolutions, and how to mitigate future problems.

Global outages are far-reaching and can have a larger ripple effect. They also are more challenging to remediate. Follow this three-step approach for faster resolution and a more proactive process to ward off future outages.

Identify the problem: In this scenario, all users in the organization are experiencing Microsoft 365 outage issues. Your customer’s global productivity is impacted. They are now vulnerable to data leaks and shadow IT (where you no longer have data visibility or control). Help desk tickets have increased.

Resolve the issue: In the typical scenario, help desks must rely solely on Microsoft data. Although the data is helpful and pertinent, the outage already has occurred, creating productivity disruption and lost revenue. Instead, MSPs should utilize a real user monitoring solution (RUM), providing real-time visibility surrounding all Microsoft 365 issues. RUM provides a dashboard that highlights impacted areas, restoration times, and outage degradation of each service. RUM also enables the help desk to implement workarounds (a temporary fix to bypass a recognized problem or limitation in the system or policies) to prevent uncontrolled shadow IT (a data security risk or breach). In addition, the use of active network path monitoring (ANPM) allows an MSP to determine right away where in the route to the cloud the issue is occurring. Together, these capabilities help the MSP pinpoint the source of the issue faster and identify whether it is within Microsoft Global Network or elsewhere. Consequently, remediation can be proactively communicated to users, help desk tickets are reduced, and productivity and revenue loss is mitigated.

Plan for future outage mitigation: Addressing outage issues should not stop at resolution. Proactive practices should be implemented and enforced to further reduce future problems. The RUM dashboard does not just address an issue in real-time. It also can serve as a valuable post-mortem assessment to thwart potential future problems. All key metrics including incidents and impacts can be analyzed over specific periods. Services and incidents automation can be tweaked to further mitigate issues.

Regional outages are limited to the user and their surrounding geographical area. Rather than complaining that it is a “Microsoft” issue, the help desk will receive tickets indicating that “my Teams calls were bad”. Productivity is impacted. Users become frustrated by downtime and the help desk is left trying to identify and fix the problem.

Identify the problem: Most help desks will run through a series of user account tests (Wi-Fi, LAN, proxy, and ISP) as well as connectivity and networking assessments. Perhaps they will reach out to other network, server, and identities teams to pinpoint the cause.

Resolve the issue: Their efforts are time-consuming but easily eliminated with a Gizmo robot deployed that regularly and automatically conducts user testing. Moreover, RUM and ANPM quickly provide user outage data and identifies where on the route to the cloud the problem is occurring, to resolve the issue at hand with minimal finger-pointing, as well as deliver workarounds to prevent uncontrolled shadow IT. Again, fast access to this data allows the help desk to frequently update customers and their users about issue resolution, decreasing the number of help desk tickets and reducing user frustration.

Plan for future outage mitigation: To circumvent future regional issues, the post-mortem dashboard analysis is extremely helpful. Global outages protocol that is well documented, in place, and utilized when necessary, adds an additional layer of protection. If multiple areas in the region experienced the outage, test accounts should be set up and periodically checked.

User Specific outages are isolated to the end user. The user is having difficulty accessing or using Microsoft 365 applications such as Teams calling. This issue will generate a help desk ticket as well.

Identify the problem: Just as in the other scenarios, the help desk struggles to figure out the problem as user frustration mounts and productivity dips. However, since it is an individual user issue, there may not be a clear outage pattern since multiple help desk tickets have not been generated. The help desk will run through the perfunctory user account and network testing to try to determine the problem. If the complaint is regarding Teams calling, the help desk will check the Call Quality Dashboard for discrepancies.

Resolve the issue: As the help desk works to identify the outage origin, the user will be updated on their help desk ticket status. If the organization employed a Gizmo robot to test usage, the problem might have been avoided altogether. Regular monitoring and testing would have identified the problem – even before the user had an issue. What’s more, RUM would have eliminated the need for user testing by the help desk, since it already points to what the user experienced before the outage. It also would validate if the service was being delivered to the user’s location.

Plan for future outage mitigation: To further enhance the mitigation of future user outages, specific user metrics can and should be updated. Outages and other hiccups are a good opportunity to improve and enforce best practices.

Conclusion

Planning today can prevent a problem tomorrow. As more of us work remotely and in hybrid workplaces, outage mitigation will take on even greater urgency. Take steps to ensure that your Microsoft 365 service offering provides your clients with the full scope of end-to-end service delivery performance capabilities. This is a valuable competitive differentiator that can drive new services revenue for MSPs.

The 24/7 availability and performance monitoring will allow for service degradation pattern detection and immediate incident response. You will be aware of issues before your customers and their users; thereby mitigating potential outages, help desk tickets, customer dissatisfaction, reduced productivity, and lost revenue. Take full control of your Microsoft 365 service operations by proactively implementing, utilizing, and enforcing services and protocols that keep you one step ahead.

For more information on how to mitigate Microsoft 365 outage issues, visit martellotech.com for an outage preparedness checklist and infographic.

About The Author

Rob Doucette is Vice President, Product Management for Martello Technologies. Rob has more than 15 years of experience building market-driven solutions with a focus on real-time diagnostics, monitoring, and analytics. He has assembled multiple software development teams from scratch, managed executive-level partnerships and relationships, and helped to secure funding from Microsoft. Before joining Martello, Rob was the CTO of Savision, now a subsidiary of Martello where he led software development teams and was responsible for building industry-leading products.

How MSPs Can Mitigate Microsoft 365 Outage Issues

Like what you are reading?

Sign up for our free newsletter

Newsletter Signup