Crunch Time Clarity

The final Push to Prod is always exciting. I will also say that Go Live week is always Fun because I only remember the good times, which is amazing because the best deployments were typically followed by a pub crawl up Moody Street.

As I work through two “Go Live” weeks, this week, Team A is coalescing into a tightly-knit unit while Team B is bogged down in the swamp of minutiae, too far away from their critical path.

Planning and preparation have been better in Team A, which is performing in-tune and on-time. The lack of planning and preparation in Team B has them learning new skills and techniques during crunch time.

Neither team has any show-stoppers in their application; both applications have good quality. The deployments may go well for both groups. However, if there are any challenges during the deployment action, Team A is relaxed enough to handle adversity with clarity. Team B is not.

The state of mind held by Team A begins with their CEO. He brought the leadership team together this week and said, in his Scottish brogue, “Isolated incidents are OK, we want to be on the watch for systemic problems. Remember gents, it’s just got to be good enough.”

This insight from leadership, in a critical moment of the project, was just what the team needed to release any undue pressure they had and put them on solid footing for the final push to production.

Hopefully, everything will go perfectly for Team B and they won’t face any moments which require dynamic clarity.


Written by Steve McGinley

How New Jersey Prepared for a Flood of COVID-19 Loan and Grant Applications

Check Out Tricentis Flood’s New Blog Post:

https://www.flood.io/blog/how-new-jersey-prepared-for-a-flood-of-covid-19-loan-and-grant-applications

#Tricentis #FoulkConsulting

Filtering out Synthetic Monitoring Traffic from Google Analytics

We have been using automated scripts to monitor web sites performance and availability. These scripts are usually launching a real browser and navigating through pre-defined steps to capture the timing and report back at least one checkpoint showing a pass/fail. The issue that recently came up that the Google Analytics reports show a few spots on the map with high traffic and those were a direct result of these scripting running every 15 minutes from 3 locations in the USA. The request came in to come up with a way to filter this monitoring traffic so that the reports would reflect real users only and exclude the monitors.

We explored a few ideas based on what Google Analytics (GA) offered. The first idea was to modify the Accept-Language header to include a test value like zzzz. We tried both adding the value to the header and replacing the entire string with the new value and neither approach worked – GA just reported the same language of en-us.

We also tried adding in a custom header name/value pair. Custom headers have an X as their first letter followed by a name we wanted so we tried X-synthetic-monitor along with value like the name of the script. For some reason, these custom headers could not be viewed in the GA tool.

We then tried a custom cookie name/value pair hoping those would be passed along to GA and alas they were not.

We tried resizing the browser to known value like 1234×987 however the JavaScript running on the page was resizing the viewport based on its own algorithms. The viewport size was also changing during the life the script as the code stepped through a workflow and the JS responded to those inputs. This is how most sites are set up today so we ended up with a range of values for width and height for the browser instead of fixed values we could pick out of the noise. Therefore resize was not a good candidate for filtering traffic.

We also took a look at filtering by the IP addresses of the machines that run the synthetic monitoring scripts but we realized those IPs were dynamic and we had no way of knowing what they were at any given time. If we looked at the current IP of a monitoring machine that could change quickly due to a VM migration, a reboot, or even load balancing to another VM. IPv4 addresses are an unreliable way to identify a particular user even a synthetic one.

Ultimately we settled on blocking the domains for Google Analytics completely in the script’s runtime settings. GA depends on JavaScript code running in the browser, gathering up data, and then sending that data to Google. If we just prevent those requests from being made by blacklisting the domains then the data never makes it to the GA system. This achieved the goal for the team as they no longer had to tweak the filters and could trust that the scripts we deployed were not included in the analytics data.

These are the domain suffixes we chose to prevent or blacklist:
– google-analytics.com
– googletagmanager.com

The other approaches that we tried would probably work for web servers that the customer controls, especially the one at the origin. We could probably investigate the logs and see the custom headers and cookies that are being sent along. We chose to leave those in place just in case the wanted to analyze logs or traffic at other points in the system outside of what Google Analytics shows.

Technical Implementation Details

Tool: LoadRunner 12.60
Protocol: TruClient Web
Domain Blacklisting: Runtime Settings > Download Filters > Exlude only addresses in list
HostSfx: google-analytics.com
HostSfx: googletagmanager.com

Headers

Modifying an existing header and merging in the new value:

truclient_step(“1”, “Execute Utils.addAutoHeader ( ‘Accept-Language’ , ‘Zzzz’ , true )”, “snapshot=Action_1.inf”);

Adding in a completely new custom header name and value:

truclient_step(“2”, “Execute Utils.addAutoHeader ( ‘X-synthetic-monitor’ , ‘mymonitoringscript’ , false )”, “snapshot=Action_2.inf”);

Resizing the browser to a specific width and height:

truclient_step(“6”, “Resize browser window to 1234 px / 987 px”, “snapshot=Action_6.inf”)


Written by JJ Welch

How Much Influence Should Developers Have Over the Tools They Use?

Overview

Excellent software has been produced by development teams that had total control over what tools they used and also by teams that had their tools mandated by management.  On the other hand, poor software has also been delivered by both approaches.

Here are some thoughts to consider as you decide how your tooling choices are made.


Pro’s

Development teams are more autonomous and have more tool choice

  • The team becomes more invested in the process of building code.
  • The more invested your teams are, the better their product should be.
  • This reduces the need for a systems administration team and the number of employees in it.

The teams are less autonomous and have less tool choice

  • Fewer tools to rely on for producing output means there is less variance to consider in effort projections
  • Fewer pockets of tooling knowledge reduces knowledge fracturing
  • Less license sprawl reduces cost
  • Teams that use the same tools can cross-pollinate the lessons they learn

Con’s

Development teams are more autonomous and have more tool choice

  • Tool sprawl, which leads to consolidated pockets of tool-centric knowledge and devalues remediation groups such as SRE’s.
  • Lack of standardization in the development process, which leads to difficult deconstruction when remediation is done by groups not involved with the initial development process.
  • An individual developer becomes a chokepoint, should they become unavailable.

The teams are less autonomous and have less tools choice

  • There may be better tools to use.
  • Your team may feel de-valued and become disinterested

Additional Influencers and Considerations

Who is responsible for the maintenance of the code, the Site Reliability Engineer’s or development team?

The team responsible for the viability of the tools should have influence.

Who pays for the tool and its administrative upkeep?

The group fiscally responsible for the tool should have influence.

How well does a particular tool integrate with your existing tooling ecosystem?

Tools that integrate easier and better should carry more weight than those that don’t.

Does a tool have a killer feature that your team MUST have?

This should be considered and weighted appropriately.

How comfortable are you with staff turnover?

If you have a well-defined requisition that is easily filled, then you’re in a position to accommodate a higher turnover rate than if you are trying to replace a specialist.  In other words, let specialists have more influence than less specialized roles.


The easier it is to collect the data required for your quality and delivery KPI’s, the less time is spent on normalization; which reduces your data’s time to market.

Normalization efforts are typically a highly manual effort.  Reducing the amount of manual intervention increases the accuracy of your governance metrics.

By applying control at the point you perform evaluations, the maximum opportunity is provided for individual tool choice.

A happy developer makes useful software and happy users.  We don’t want your engineers feeling like Bill Parcells.

There will be a proper balance of accuracy, time to market, and staffing for your situation.


Written by Steve McGinley

Shared Application Infrastructure Can Be A Critical Performance Constraint and Blindspot To Your Business

The Scenario

Your business has constructed a variety of application environments designed to pursue and expand your mission-critical business plans and gain market share. Over time, security concerns, maintenance and user convenience steered you to building a unified, shared authentication infrastructure such as Identity and Access Management (IAM). 

In this situation your financials system uses IAM, as does your time recording, expense, sales tracking, ticketing and development systems. They all tie-in to a single system for authentication and datacenter asset access control. This has saved your business countless hours in maintenance, unified your account management efforts, and put a singular, controllable and effective system in place to manage this part of your business. So what’s wrong, doesn’t this solve your problems? 

Foulk Consulting can help you identify and avoid the type of problems that the following case study highlights. Here’s how we did it with a high profile Internet E-Retailer. 

Case Study: E-Retailer Preparing for Peak Usage Season

An E-Retailer built multiple, external, customerfacing applications that they also invested a lot of time and effort to performance test them all. The testing goals were ultimately clear and refined, they were based on each of those application’s usage models during the preceding peak season, combined with other business requirements and usage projections for the upcoming peak season that had to be met. In the course of these performance testing efforts, each test was focussed on a specific customer-facing applicationtesting and validating that application’s ability to sustain the expected load and perform within its defined success criteria. 

Through their hard work they built a case internally that they were very well prepared for the upcoming peak season. However, when the peak season hit, the reality of the situation became clearthat something was terribly wrong and the systems could not handle the actual peak customer loads. 

The IAM infrastructure was tested with each application independently and it performed as expected.  

During their annual peak eventuser sessions were dropped at an unexpected and unacceptable level. Most of the new users could noeven register new accounts, nor could most returning customers log in to make their purchases. The failures to convert shopping activities to sales resulted in them missing their projected revenue numbers by a large margin, in fact they faced crippling losses for the year. A story that has been played out too many times in recent years. 

So how did this happen? Didn’t we say that they validated this load before the peak season? It sure seems that they did their due diligence and stressed the machines beyond the traffic levels that they expected, which were even higher than the year before. How is that possible? Did they test the wrong things? No. Did they fail to test dependent subsystems to the right levels? Not exactly. How can you create good tests, find defects and configuration problems, resolve them, validate your efforts, demonstrate that your testing exceeded the loads you were expecting, and still watch your systems fail under real user load? 

As a result of this missed business opportunity and their public hit to customer confidence, Foulk Consulting was contacted to assist in analyzing the situation. We came in and worked with all the involved teams. We reviewed their various testing practices, their planning and assumptionswe dove into the peak load logs and analytics and the event data that was collected, in an attempt to try to find the keys to how this situation manifested itself, and more importantly help this valued customer devise an approach to prevent this sort of thing from happening ever again. 

Critical Analysis & Planning

Coming out of the peak season with this blackeye, the business hired Foulk Consulting to assist them in analyzing the planning and testing leading up to the event as well as the forensic data from the event itself. Our goal was to not only illuminate where the problem originated and manifested itself, but how they could prevent this in the future – to help them solve the actual reasons why they were losing customers and market share, following up on all the hard work they had done before. 

After thorough investigation and analysis, it was found that despite each of their applications being tested thoroughly and each of them proving to meet performance objectives, maintaining stability and performance under load, what appeared out of the piles of data was surprising to everyone in the organization. 

The root cause appeared to be a complete lack of preparation on the IAM tier as well as the dependent login integrations across the various applications when the real customers loaded multiple applications at the same time during peak usage it created a classic bottleneck; the request demand overloaded the system’s ability to respond in a performant manner. 

 

When each application was tested in isolation, the IAM tier was capable of handling that application load without any problems. When the actual peak production load hit all the applications at the same time, however, the IAM infrastructure became a complete bottleneck and customers could not interact with the systems. This was found to be the root of the peak usage collapse. 

Finding the root cause was only part of the puzzle. Foulk then took that information and applied a performance engineering approach to testing that would ensure that the business was able to identify, isolate and test the systems that demanded attention. The plan was designed to focus the customer’s efforts and identify exactly who was responsible for resolving the issues during their next preparation and deploymencycle. 

Ultimately we were able to work with the various application teams and construct a performance test that stressed the IAM tier realistically, reproducing the production peak load failure model 

Armed with this repeatable test structure, the customer was then able to work with the application support, operations and vendor teams to work through the issues and build the IAM infrastructure to pass the load test criteria and remove their significant production performance bottleneck in the IAM tier. 


Written by Brian Brumfield