The Scenario
Your business has constructed a variety of application environments designed to pursue and expand your mission-critical business plans and gain market share. Over time, security concerns, maintenance, and user convenience steered you to building a unified, shared authentication infrastructure such as Identity and Access Management (IAM).
In this situation, your financials system uses IAM, as does your time recording, expense, sales tracking, ticketing, and development systems. They all tie into a single system for authentication and data center asset access control. This has saved your business countless hours in maintenance, unified your account management efforts, and put a singular, controllable and effective system in place to manage this part of your business. So what’s wrong, doesn’t this solve your problems?
Foulk Consulting can help you identify and avoid the type of problems that the following case study highlights. Here’s how we did it with a high-profile Internet E-Retailer.
Case Study: E-Retailer Preparing for Peak Usage Season
An E-Retailer built multiple, external, customer-facing applications that they also invested a lot of time and effort to performance test them all. The testing goals were ultimately clear and refined, they were based on each of those application’s usage models during the preceding peak season, combined with other business requirements and usage projections for the upcoming peak season that had to be met. In the course of these performance testing efforts, each test was focused on a specific customer-facing application, testing and validating that application’s ability to sustain the expected load and perform within its defined success criteria.
Through their hard work, they built a case internally that they were very well prepared for the upcoming peak season. However, when the peak season hit, the reality of the situation became clear, that something was terribly wrong and the systems could not handle the actual peak customer loads.
The IAM infrastructure was tested with each application independently and it performed as expected.
During their annual peak event, user sessions were dropped at an unexpected and unacceptable level. Most of the new users could not even register new accounts, nor could most returning customers log in to make their purchases. The failures to convert shopping activities to sales resulted in them missing their projected revenue numbers by a large margin, in fact, they faced crippling losses for the year. A story that has been played out too many times in recent years.
So how did this happen? Didn’t we say that they validated this load before the peak season? It sure seems that they did their due diligence and stressed the machines beyond the traffic levels that they expected, which were even higher than the year before. How is that possible? Did they test the wrong things? No. Did they fail to test dependent subsystems to the right levels? Not exactly. How can you create good tests, find defects and configuration problems, resolve them, validate your efforts, demonstrate that your testing exceeded the loads you were expecting, and still watch your systems fail under real user load?
As a result of this missed business opportunity and their public hit to customer confidence, Foulk Consulting was contacted to assist in analyzing the situation. We came in and worked with all the involved teams. We reviewed their various testing practices, their planning, and assumptions, we dove into the peak load logs and analytics and the event data that was collected, in an attempt to try to find the keys to how this situation manifested itself, and more importantly help this valued customer devise an approach to prevent this sort of thing from happening ever again.
Critical Analysis & Planning
Coming out of the peak season with this blackeye, the business hired Foulk Consulting to assist them in analyzing the planning and testing leading up to the event as well as the forensic data from the event itself. Our goal was to not only illuminate where the problem originated and manifested itself but how they could prevent this in the future – to help them solve the actual reasons why they were losing customers and market share, following up on all the hard work they had done before.
After thorough investigation and analysis, it was found that despite each of their applications being tested thoroughly and each of them proving to meet performance objectives, maintaining stability and performance under load, what appeared out of the piles of data was surprising to everyone in the organization.
The root cause appeared to be a complete lack of preparation on the IAM tier as well as the dependent login integrations across the various applications when the real customers loaded multiple applications at the same time during peak usage it created a classic bottleneck; the request demand overloaded the system’s ability to respond in a performant manner.
When each application was tested in isolation, the IAM tier was capable of handling that application load without any problems. When the actual peak production load hit all the applications at the same time, however, the IAM infrastructure became a complete bottleneck and customers could not interact with the systems. This was found to be the root of the peak usage collapse.
Finding the root cause was only part of the puzzle. Foulk then took that information and applied a performance engineering approach to testing that would ensure that the business was able to identify, isolate and test the systems that demanded attention. The plan was designed to focus the customer’s efforts and identify exactly who was responsible for resolving the issues during their next preparation and deployment cycle.
Ultimately we were able to work with the various application teams and construct a performance test that stressed the IAM tier realistically, reproducing the production peak load failure model.
Armed with this repeatable test structure, the customer was then able to work with the application support, operations, and vendor teams to work through the issues and build the IAM infrastructure to pass the load test criteria and remove their significant production performance bottleneck in the IAM tier.