Performance Matters

How SOASTA Leverages Cloud APIs

Our position as one of the first enterprise cloud platforms has given us a great view of the evolution of the cloud. On any given day at SOASTA we may be looking for thousands of cloud servers to simulate millions of virtual web site visitors. As a result, we’ve done more cloud testing across more cloud providers than anyone else.

In doing so, SOASTA is a consumer of Infrastructure-as-a-Service (IaaS) and a provider of Software-as-a-Service. Amazon was among the first to make IaaS available as an elastic, pay-as-you-go cloud service and have since been joined by many other providers such as Rackspace, IBM, Microsoft and GoGrid. Along with these public providers, companies like Eucalyptus and Nimbula offer behind-the-firewall cloud platform solutions.

SOASTA’s CloudTest platform depends on the swift provisioning and releasing of servers, and we must quickly identify bad instances and bring up replacements, while ensuring that each server plays its role in a distributed, multi-vendor architecture. To do this, and best take advantage of the capability to easily and affordably start, stop and manage servers, SOASTA uses cloud vendor APIs for automation.

Using Cloud APIs

Elasticity is most commonly associated with changes in supply and demand based on price. The application of elasticity in the cloud is not much different. An elastic API refers to an infrastructure vendor’s ability to respond to demand by allowing customers to quickly and automatically spin up servers, and just as quickly take them down. For applications such as performance testing this is incredibly important.

SOASTA evaluates cloud vendors based on a number of criteria. Our initial experience is always through the provider’s web-based user interface. If servers come up fast, the configuration options are appropriate, the GUI is capable, and the business model fits, we switch our attention to the cloud vendor’s API.

SOASTA’s unique grid provisioning technology uses the APIs to automate test setup. CloudTest is a sophisticated, distributed test platform that is controlled through a browser. Testers use a wizard (shown below) to quickly select provider(s), location(s), number of servers and type of servers so that within minutes they can execute large-scale performance tests.

The API should support functions such as start, stop, reboot and resize. In addition, using an API highlights the importance of intelligently handling errors and providing clear notifications. As always, meaningful error messages go a long way toward troubleshooting when there’s a problem.

No cloud vendor is immune to instances occasionally failing upon launch. Some vendors have fewer issues than others, but when it does happen you want to delete that instance and automatically replace it through the API. Also, you want to be able to do it no matter what state the instance is in. An image stuck in provisioning mode, or that can’t be stopped at all, is inconvenient, at best, and expensive if not caught.

If, like SOASTA, you have a requirement or can benefit from going cross-cloud, you’ll have to deal with the fact that every API is still quite different. Many vendors will tell you they have an AWS-like or compatible API, but that doesn’t mean you can just unplug from one and plug into another. We take care of this behind the scenes for our customers using CloudTest.

In addition to the proprietary APIs, there are ongoing collaborative efforts at creating standards within the cloud infrastructure community. Efforts like libCloud, DMTF’s Open Cloud Standards and OpenStack have gotten traction, the latter probably more so than any other to date. While still not ready for primetime, OpenStack, initiated by Rackspace and NASA Ames, is supported by players such as Dell, Cisco and Citrix, among others, and serves as the basis for the upcoming HP cloud offering. It’s also garnering interest from enterprises as the basis for internal deployments.

Cloud-based infrastructure services have matured dramatically in the last three years, with greatly increased reliability and capacity. Today, there are dozens of providers with locations around the world providing access to hundreds of thousands of affordable server resources. This access allows individual developers to exercise their creativity, and companies like SOASTA, using the available APIs, to provide services at a speed and cost that was impossible just a few short years ago.

Don’t Neglect Your Mobile Customers

Errors that only your mobile or tablet customers may see

Every day, more and more of your customers move off their standard desktop computers in favor of smaller and lighter alternatives. Today, many people do much of their day-to-day Internet activities exclusively on their smartphones and tablets. One can no longer assume that the majority of traffic coming to your web application will come from the traditional desktop computer; this is especially relevant for e-commerce sites.

According to data collected by Digby, this holiday season, more than ten percent of Cyber Monday shoppers used their mobile devices to make a purchase, and total mobile purchases accounted for more than 6.5% of all sales made by consumers. Both of these statistics are up significantly from last year, and the growth shows no signs of slowing down. In fact, 67% of consumers reportedly plan to make at least one purchase from a mobile device this year. As this trend continues, considerations must be made for how companies plan and structure their applications, as well the implications on performance and functional testing needs.

Recently, SOASTA was involved in a project for a major retailer preparing to launch a new and highly anticipated product. This retailer estimated that 30% of the launch-time traffic would originate from smartphones and tablets. Because of the unique recording technology built into CloudTest, we were able to construct a test that ultimately exposed several bottlenecks in the mobile order paths, which were not present in the standard desktop browser flows.

Under load, both the tablet and smartphone paths showed a significant number of errors when adding products to a shopping cart:

In the chart above, not only do we see a large amount of HTTP 500 Internal Errors (which occurred whenever the response time for a request exceeded 10 seconds), but there are also a substantial amount of asset requests that return ‘Not found’ during the test (an asset which happened to be vital to the completion of an order). Every one of these errors represented a sale that did not complete. In just 15 minutes, more than 7,000 orders failed due to these errors. To make matters worse, the results described above were produced by a test that was configured to run at only 10% of the volume and concurrent connections expected when the new product launched!

As a result of this test, the customer discovered a problem with their application code that caused these errors and changed it prior to launch.

Errors are bad enough, but what about response times?

This particular test exposed some critical scalability issues that only showed up under load – and only showed up on mobile and tablet devices. Unfortunately for our customer, it wasn’t the only issue observed during this test. Certain page and transaction times were extremely high as well – as shown in the chart below.

Here we see the response times per transaction experienced by mobile users, again at 10% anticipated launch traffic. The final step in the user experience (and in this case, arguably the most important: when the user’s payment is processed) had an average time to complete of more than 30 seconds on the mobile and tablet sites! This is in contrast to the desktop browser test scenarios which showed virtually no slowdown or issues. This helped the customer isolate a specific database contention issue that was unique to the mobile and tablet flows.

Mobile Implications

Fortunately, these issues were detected early, which allowed our customer the ability to address them. If these issues weren’t resolved, a failure of their site would have had a substantial impact on revenue.

Keep in mind, however, that your visitors are utilizing your site for more than direct purchases; they’re also looking for promotions, product descriptions, and locations or directions to brick and mortar stores. This type of usage requires additional design considerations for the layout of your site’s pages. Besides the obvious considerations for the physical differences of the device itself (screen size, touch screen input, etc.) we must also consider bandwidth availability and transfer speeds of mobile devices, and keep those constraints in mind when executing our tests and analyzing results.

Because mobile devices typically take longer to complete a transaction (especially on 3G and Edge networks), the resulting connections from those devices are kept open longer. This has the potential to increase the total number of active connections when user traffic is high. Failure to tune your application and network to properly handle these longer connections may result in missed opportunities. Even if the application servers have ample system resources to handle more traffic, potential customers will be unable to access the application because of insufficient connection availability.

Whether it’s a mobile site or a mobile application, SOASTA CloudTest is able to record directly from mobile devices, allowing anyone to create robust test scenarios without the guesswork involved with user-agent spoofing or other artificial means of deceiving your servers. CloudTest provides the analysis tools required to determine what users would experience when they visit your site, regardless of where they are and what type of device they are using.

The Importance of “Bursting” Static Assets

When load testing at the HTTP level, it’s important to emulate the browser as closely as possible. In a browser, the HTML page is loaded first. Then, in parallel, the links for the static assets on that page are parsed and requested from the target web server. CloudTest, like most contemporary browsers, defaults to six parallel threads to download the static assets on a page. This provides a realistic method for testing a web application. Within CloudTest, the ability to download assets in parallel is referred to as a “burst”.

Running in sequence would be an alternative to “burst” mode. This means the test application requests the HTML page and, subsequently, all of the static assets are requested in sequence (no parallelism). Unfortunately, this mode only uses one connection, so the test must wait for the response of one request before it sends the next request. What we find is that the time it takes to load a page in a browser is much faster than the time it takes to load a sequenced page. The question is how much faster?

I set up a simple test to show page load times using the CloudTest “burst” method (set to the default of six parallel threads, which is configurable) vs. the “sequenced” method. The test has one user for each method repeatedly hitting the SOASTA homepage. The screenshots below show that by “bursting”, as a browser does, we are generating 4x more traffic per user. In one minute a user hit the home page 82 times when configured with multiple connections, and only 21 times without. The average time to load the burst page was 0.75 seconds, and the sequenced page was 2.85 seconds.

The inaccurate page load times are the most obvious issue. Performance with Sequenced users was 4x slower that Burst users. The not so obvious issue is with throughput. Each virtual user is generating less traffic per second than a real user would. The site is serving fewer HTTP requests per second and lower amounts of bandwidth. Fewer hits per second can be attributed to less CPU usage on the servers. The danger here is that a site could potentially crash with less users than were actually tested. The site might hit a bandwidth limit or a performance bottleneck on CPU in the real world that it never saw in testing.

To show the impact on throughput, I set up a second test using CloudTest Lite. In this test, I have one user hitting the page every five seconds. This should take away any concerns with think time or pacing. Every five seconds we hit the home page. Simple enough.

The charts below report the “Send Rate” (HTTP requests/second) and Bandwidth (Megabits/second). The first chart shows the traffic patterns of a Sequenced user. Since the page takes longer than a second, the HTTP requests are always spread over more than one second. This results in the website never serving more than 23 HTTP requests/second, or 1.6 Megabits/sec. The test is artificially throttling the traffic.

The second chart shows a “Burst” user. The multiple connections used in burst mode allow for a faster and more realistic download of the page. In this example, the website is serving 32 HTTP requests/second, or 2.3 Megabits/sec. This equates to 40% more HTTP requests/second, and 43% more bandwidth per virtual user.

At SOASTA, our methodology with HTTP load testing is to accurately simulate the traffic of a real user. Our goal is to accurately predict how well a site can handle expected traffic. In our experience, doing so is much more than just having the right “Think Times”. Accurate simulation of the way a browser loads a page can have a huge impact on the accuracy of the test. In this simple example we found response times with 4x variability and throughput with 40% variability. When using “sequenced” loads, you could be simulating much lower traffic than you think and you could be overestimating the number of users your site can successfully handle.

 

Finding Bottlenecks Through Monitoring and Iterative Load Tests

A medium-size web development shop engaged us for its first experience with external, scale testing on a brand new application scheduled to roll out a few months after we started. The application is written in Ruby on Rails (ROR) with a MySQL backend, and serves as an internal social networking hub for a company. Through this application, users can create projects and tasks that are both work related and personal, invite others to join them in completing these tasks, share their calendars, and exchange instant messages with other online members. The front end is entirely JavaScript driven and consists of a number of widgets, all of which are updated through AJAX calls to the server. In addition, the client intends to host this application in the cloud, which would be a first for them.

The test consisted of 10 scenarios including simple browsing of the site, creating projects and tasks, and randomly sending instant messages to other online members. The most challenging part of test creation was dealing with the asynchronous nature of the application. There were no clear-cut events that broke the script into URLs or pages. There was only one URL followed by a number of AJAX calls to the server. CloudTest’s nested transaction grouping allowed us to put these requests into logical groups that represented all of the main business processes to be tested.

The first test was against a single server that contained all of the moving parts of the application and was hosted onsite. The application performed well until we reached 200 simultaneous virtual users. At that point, things got really slow. At about 300 users, the site was unusable and every request came back with a HTTP 500 error.

Our second test was against a more distributed environment. The application server and the database were split between two different machines, which were hosted in Amazon EC2. The site did slightly better, topping out at 300 users. Using CloudTest’s monitoring, we noticed that the database was not working very hard and that the web server was at nearly 100% CPU utilization for most of the test.

During subsequent tests we tried a variety of configurations. CloudTest makes it very easy to do iterative load testing. We spread the load even more by putting several application servers behind an EC2 load balancer and increasing the horsepower of the database server. Although most of these changes made a visible improvement, we were still faced with an obvious bottleneck. At roughly 800 simultaneous users, the application started to slow down, and at 1200 users it became almost unusable. At this point all of the usual suspects looked innocent. The app servers’ CPU never got higher than 50%, the database had plenty of breathing room and we were not limited by bandwidth.

When we looked at the “Collection Summary” report from CloudTest we realized that one particular transaction, which was taking longer to complete than others, needed more attention during the test. As you can see from the chart below, the “CREATE TASK” transactions took significantly more time to complete, even when there was little load on the system.

The ops team monitored the app server with New Relic, which has a ROR profiling module and is integrated into the SOASTA dashboards. We found that a CREATE TASK business process was using a ROR ORM Engine called Active Record to perform MySQL queries. A particular query in the CREATE TASK process was not “tuned”, and instead of running one or two SQL statements to retrieve the data, hundreds of queries were run each time a list of tasks was requested by the application. Although the queries themselves were very lightweight, running hundreds of them in sequence for each process slowed things down considerably.

Our findings were brought to the attention of the development team. The team was able to revise the process to access the same data with just one query to the database. As a result, in our subsequent tests we saw a great improvement in the CREATE TASK transaction. Monitoring the infrastructure during a load test once again proved invaluable, and having the results on the same timeline helped surface the issues much more quickly.

 

The Effects of Using Think Time to Adjust Level of Load

Some load tests are run with restricted resources, either because of deficiency in load generation muscle or from licensing constraints with the load-testing tool being used. Sometimes the performance engineer wants to quickly start a test and ramp up to peak traffic (however that may be defined) very quickly. In these situations, it is tempting to use the think times in your test case scenarios as a variable for adjusting the “level of load” that you drive. This “level of load” may be defined by “number of HTTP requests per second”.

To see where I’m going with this, suppose you have set up your test cases and workload to generate your target load with 500 virtual users and 15 second think times between pages. Now, suppose you need to generate the same target load, but with only 100 users. It might be tempting to assume that you can divide the think times by 5 and thus increase the throughput by a factor of 5, thereby allowing you to divide the number of virtual users by 5. The assumption can be stated as follows:

Assumption: If 500 virtual users with 15 sec think times yields x requests/sec, then 100 virtual users with 3 sec think times should also yield x requests/sec.

This assumption is flawed for a couple of big reasons:

  1. The response times are not being taken into account – both the think times and response times are factors in determining how much throughput your virtual users will get.
  2. This assumption requires making another assumption – your application will respond linearly with additional load, regardless of think time. That is, you are assuming that response times will be the same between the 500-user and 100-user workload. Additionally this assumes the application only cares about the average request rate and nothing else.

Point #1 above can be illustrated with basic math. A calculation of “average HTTP requests per second” can be made with the following equation:

We have two workloads in question. For now, let’s just go with the assumption that average response times should be the same for each workload. Let’s say we are dealing with 2.5 sec average response times and that each user is making 10 requests per page, on average (the HTML document + 9 page assets). Then we have the following:

500-user workload with 15 sec think times

100-user workload with 3 sec think times

Clearly the two workloads above are not equivalent in terms of requests/sec being generated. Now that we are taking the response times into account (and assuming they will be the same in each workload), we can set x = think time/page and then solve the resulting equation. Indeed, if you do this you will find that you need to use about a 0.1 second think time for the 100-user workload to generate the same throughput as the 500-user workload. How unnatural is it to have users with 0.1 sec think times? At the very least, it’s significantly different user behavior from the original workload.

The second point above can be illustrated with a real-world experience of mine before I had the benefit of leveraging CloudTest Pro’s ability to scale to enormous numbers of users. I was working with a tool that is licensed by the number of concurrent virtual users in use. There were periods of time when multiple teams within the company were simultaneously making use of that license, thus constraining the number of users that each team could use.

I was testing a back office application, which essentially had 2 different app layers and a database layer. The test case was simple. It merely consisted of 2 web service calls – one to get a sales tax calculation and one to do an order placement, with think time in between. The sales tax request was made directly to the 1st app layer, which made several 3rd party calls outside, then forwarded the request to the 2nd app layer, which managed connections to the database. The 2nd app layer would then make a JDBC connection to the database and read the corresponding sales tax data. My standard workload consisted of 800 virtual users with a think time modeled after data from production logs. Through a little bit of calibration testing I figured out that I could reduce the number of virtual users to 160, reduce the think times significantly, and have a workload that still generated the same number of sales tax and order placement requests per second.

After multiple rounds of testing, with both workloads, I had a discrepancy. Both the 800-user and the 160-user workload generated the same transaction rate against the application, but only the former hit a bottleneck in the app. With the 160-user workload, the application ran flawlessly. With the 800-user workload the database CPU pegged at 100% and the connection pool became completely saturated. It turns out that longer think times can consume more resources than expected. In this case, the longer think times caused the apps to maintain and manage more concurrent connections, specifically database connections. With longer think times, enough concurrent connections were created to max out the database connection pool, thus causing a bottleneck and forcing the queuing of some incoming requests.

The underlying theme here is that there are so many variables involved in any web architecture that it’s best model your virtual users as closely as possible to real-world users – model your think times carefully. In my case, the JBoss connection pooling configuration and connection timeout values were the unseen variables that responded to changing think times. Who knows what it will be in your case?

Stop Cheating on your Tests!

I suppose we could have used a less inflammatory title for our recent webinar. It makes it sound like testers have been purposely doing something wrong. Perhaps we could have titled the webinar “Now you can execute more accurate and informative tests!” But the folks in marketing were right, and the intriguing title attracted our largest group of attendees ever. For those of you who didn’t attend, or if you did and would like to review the messages, you can watch the webinar here. This was the first in SOASTA’s latest webinar series, “Cloud Testing – Rewriting the Rules of Performance Testing”. Future webinars include “Run More Tests and Find More Issues” on October 27th and “Test On Your Schedule across the Lifecycle” on November 15th.

In this webinar, Scott Barber, President and CTO of PerfTestPlus, joins SOASTA’s VP of Performance Engineering, Rob Holcomb to discuss what performance engineers have done in the past to measure performance and find and fix issues; and why some of those techniques no longer reflect best practices. The focus is on web and mobile testing and why the higher scale, more distributed and often complex nature of that traffic is not well served by traditional testing tools or techniques.

After an introduction by SOASTA’s Brad Johnson, Scott, in his inimitable style, speaks to great effect about the four most common ‘cheats’ that performance testers have leveraged to overcome the constraints of inflexible test hardware, poor tool scalability, expensive pricing models and the lack of real-time information while testing. Scott begins his presentation by talking about the practice of modifying think times, typically to overcome licensing and/or hardware limitations imposed by the high cost of traditional load testing. His primary assertion: the only way to simulate production…is to simulate production. Interestingly, during the webinar a question came up suggesting that we’re testing computers, not humans, so why is accurately simulating user activity so important. In response, it was noted that it has become clear that variance in use absolutely has an impact on what happens to the infrastructure.

The second point discussed is the common practice of extrapolating results from a staging environment to predict what will happen in production. Architectures can be complicated, and the impact of those differences along with the additional complexities of ‘the real world’ make extrapolation problematic, at best. The best way to validate that your production environment will handle expected load is to test in production as part of your overall test strategy. (For more on testing in production check out this SOASTA webinar). Modeling user flows incorrectly is the third point addressed by Scott, reinforcing the notion that we’re not functionally testing the application, but need to make sure we’re putting a realistic load on the back end.

Finally, Scott presents a very interesting problem to illustrate the challenges associated with measuring performance, and how it can be as much an art as a science. Rob follows Scott’s presentation and, using SOASTA’s CloudTest, illustrates how we can use modern tools to, well, stop cheating. We hope you enjoy(ed) the webinar.

Implementing 2-Legged OAuth in Javascript (and CloudTest)

Introduction

If you’re reading this you are probably looking for information on how to implement 2-Legged OAuth in Javascript.  I recently had to implement 2-legged OAuth into a CloudTest performance test for one of our customers.  Because 2-legged OAuth is not part of the official OAuth spec yet (as of 6/15/2011) there is relatively little information outthere about how to make this all work. Where there is information unfortunately it doesn’t universally work for all implementations since there isn’t a specification for it.  I hope this saves you some time… it definitely would have helped me out.  You will need a working knowledge of Javascript to find the implementation details in this article useful.  Without an understanding of Javascript you may find just the general OAuth overview interesting.

High Level OAuth Overview

OAuth is a way for applications to authenticate with one-another.  In essence a client application encrypts a string of values and passes that encrypted string, along with the values it used to encrypt it (except one, your secret key), to theserver.  The server then uses the values you sent across to look up your secret key and attempt to generate the sameencrypted string you did. The server then compares the two encrypted stringstogether.  If they match, it’s a success.  If not, it’s a failure.

The difference between 3-Legged OAuth and 2-Legged OAuth is that in the 3-Legged variant, the client first passes some credentials to the server and gets an access token back if authentication is successful.  Then this token is passed along in subsequent requests.  This is commonly called ‘the dance’ in OAuth developer circles.  When you authenticate with Netflix through various platforms (AppleTV, iPhone, Netflix.com), you do a 3-Legged OAuth dance.  This allows for users, applications, and authentication to be abstracted out into separate tiers.

Some other types of applications may be better suited for the authentication and message passing to happen in 1 request and 1 requestonly. This is where 2-legged comes in. In 2-legged OAuth you pass the encrypted string, the values used forencryption, and the message payload in 1 GET or POST.  If it is rejected, the message fails.  If it’s accepted then the message is processed.  This particular app that I was working on testing was a central logging system.  Every message was a log event.  There was no time (or functional need) for a three-way handshake in this app and also no notion of a maintained state.  2-Legged OAuth cuts out the middleman.  If authentication is successful the message is processed, no dancing around.

Click here to dive deeper

The Fragility of Web Applications

Most web applications have some code in them that if utilized even slightly more than expected could send the whole stack toppling over. Web apps and their associated infrastructure are fragile and they must be performance tested thoroughly across all key functional areas. This testing should be done above expected traffic levels… the following is an example of why this is so important.

I was recently testing for a very large online retailer. Their site has the typical shop, buy, and self service functional areas. On this particular application there is an Ajax call to create an empty shopping cart (sometimes referred to as a transient session) as soon as you start browsing the catalog. It’s a lightweight and seemingly harmless GET request that passes through to the app tier to initialize the empty shopping cart. This cart is an in-memory object at the application tier.

What we discovered through testing was that if we generated the exact profile of traffic they were expecting with people browsing the catalog and creating empty shopping carts, along with customers adding products to the cart, then there was enough capacity to perform well. However, if we adjusted the load mix just slightly to have either more empty carts, or more carts with items in them, then the entire application slowed down and ultimately fell completely over. This affected not only people in the shopping experience but everywhere… the entire site went down.

This really got me thinking about how fragile apps really are unless you test all of the different components past their expected load levels, and assess not only their performance, but also the performance of the components around them.

Every web application has a weak link somewhere. Do you know where yours is? I bet that a very small load test that makes one particular type of request directed at your application could have catastrophic results. It could be 10% more users logging in than normal, or the worst case, more people trying to check out than you had planed for. I’m amazed at how many people market a flash sale and don’t change anything on their application to account for a totally different load profile than normal. We need to find those weak links and build them out to be more resilient.

Web Performance Optimization – Fun for Performance Engineers!

The Velocity conference is underway (SOASTA is one of the sponsor). Looking at the schedule, you will realize that topics cover all critical factors to take into consideration when building scalable, fast and reliable websites and services.

  • Operation: Deploying large and scalable cloud infrastructure, infrastructure automation, real-time monitoring, distributed systems, NoSQL databases etc.
  • Mobile performance: Analyzing mobile performance, optimization, writing fast client-side code, device comparison etc.
  • Web Performance: Optimizing server-side scripting with NodeJS, Client side optimization (image, javascript, browser specific optimization, HTML5 etc.), automated web performance testing etc.

Today’s performance engineers have the opportunity to get involved in all these area and this is why this is such a great time to be in performance engineering.

Want proof? Take a look at the following project some of our performance engineers tackled some months ago.

The web application to be tested was developed by the government of one of the largest country in the world. It was a form based application all adult citizens in the country had to use. We’re talking a massive number of users! The application was globally distributed across multiple private datacenters to offer failover option as well as provide the best user experience possible. The technical environment was your typical Struts application framework with an Oracle database for back-end. The target was to reach a clean test at 172,000 concurrent users using the application on the production environment. Yes, you read it right. On a production environment! Our performance engineers run a fair portion of their tests on production systems, especially at the end of the application development life-cycle. This is the environment the application is going to be accessed from and problems you find at that stage can’t be found when the application is installed in a lab or in a staging environment (Loadbalancer problem, CDNs, Bandwidth etc.). If you want to learn more about performance testing in production, there is a webinar available.

This was a 2 months projects and our engineers had some time to perform state of the art performance testing: Starting with a low number of concurrent users (500), fixing issues at this low level before moving on to greater level ie. 1000, 5000, 10k, 50k etc. That’s a fairly typical approach but sometime overlooked by some of our customers. We usually educate them while going through our performance testing methodology.

Many problems were found and fixed before reaching 100k: Servers misconfiguration, oversized pages, poor client-side caching, SQL optimization, login process optimization (One of the typical optimization engineers have to deal with regularly as companies tend to retrieve all kind of heavy information during the login procedure).

One problem really drove our engineers crazy for a few weeks. (The type of problems they love as it is really challenging and rewarding to get them fixed!). At a fairly substantial level of concurrent users (110k+), they were observing the following real-time chart:

They ran hundreds of tests at this level and were getting plagued by the same issue: As soon as the first users were done filling up their form and submitting it, overall response time was getting to the roof, throughput dropped significantly and error rate reached 6.5%! Not a pretty picture and a really bad user experience: Spending 15 minutes to fill governmental forms and getting an error when submitting is not what most users want to go through.

After many days scratching their heads, the problem was finally identified. This application was distributed across the world with a global Load Balancer taking the requests and route it to local load balancers closer to the user. As it was a highly secured application, the global load balancer was responsible to serve a highly encrypted (2048 bits) certificate during the submit process. When it was serving this certificate, CPU on the load balancer would go sky high and the throughput would drop. Our engineers had the idea of reducing the encryption level to 128 bits as a test and sure enough the CPU level on the load balancer was normal and it was being able to serve the certificate as expected. But as soon as the encryption level reached 2048 bits, problems were back. We ended up contacting the load balancer manufacturer and with all the data collected during tests it was all the information they needed to provide a firmware fix.

This is the final test results with the firmware fix.

Very clean test! Very low average response time (440ms), extremely low error rate (0.001%!) and 172,000 virtual users getting through the process without a itch!

That’s what performance engineering is all about these days! Getting involved with large scale projects, learning the full scope of performance optimization, teaching customers performance best practices. And best of all, there is product innovation to bring engineers all this fun! CLOUDTEST! And soon a very big surprise for all performance engineers in the world … !

Now is the Time to be in Performance Engineering!

I’ve always considered performance engineering as the most rewarding discipline in software testing. In my opinion, this is where you have the most opportunity to learn, especially technically. Great performance engineers follow Cem Kaner principles described in his Bug Advocacy paper and especially this one:

The best tester isn’t the one who finds the most bugs or who embarrasses the most programmers. The best tester is the one who gets the most bugs fixed.

It’s about finding the right ways to communicate problems and giving as much useful information to the developers, DBA and IT guys responsible for the infrastructure where the application under test resides. It’s about dealing with objections from these people, motivating them to consider the problem seriously and to start investigating it. It’s also about pinpointing the problem in the right direction. Great Performance Engineers need to be good salesmen and need an amazing amount of knowledge to get the issue they’ve found fixed, whether it’s in the application code, the infrastructure in which the application resides or elsewhere in the overall architecture!

Great Performance Engineers get to learn about:

  • The intricacies of load balancers, especially since they’re one of the primary sources of contention when dealing with high volume applications. A lot of companies take load balancer configuration for granted and don’t bother testing their algorithm under load.  A BIG mistake!
  • CDN configuration. Again one of the top problems our Performance Engineers find when testing applications from outside the firewall.
  • Bandwidth usage and its implication on the overall performance of the application.
  • Auto-scaling mechanisms.
  • Garbage collection, memory leaks, unoptimized database schema and queries, optimizing CPU consumption, etc.
  • Everything about front-end optimization: Browser caching, expired headers, cache busters, image optimization, lazy loading, progressive rendering, etc.

Performance Engineers are able to test today at a scale they couldn’t dream about 4 years ago. Look at the test below: a 58 min test with 7 Terabytes of data received! A “big data” problem Performance Engineers can have fun with these days.

They’re able to test from inside and outside the firewall, providing coverage for problems they couldn’t previously replicate. They can, with CloudTest, get performance results in real-time and have conversations with developers, DBAs, Ops and other IT constituents during the test, increasing their chance to solve problems quickly, and to learn. A recent engagement with a large telecommunications company in the US brought 90 people together during the 2 hour test. A great learning opportunity!

If you’re eager to learn, and help companies get the best performance from their application, this is the best time to be in performance engineering. Best of all, SOASTA is hiring!

Email Us!
Subscribe to our Feed!
Find us on Facebook
Follow our Tweets
See our pics