The Velocity conference is underway (SOASTA is one of the sponsor). Looking at the schedule, you will realize that topics cover all critical factors to take into consideration when building scalable, fast and reliable websites and services.
- Operation: Deploying large and scalable cloud infrastructure, infrastructure automation, real-time monitoring, distributed systems, NoSQL databases etc.
- Mobile performance: Analyzing mobile performance, optimization, writing fast client-side code, device comparison etc.
Today’s performance engineers have the opportunity to get involved in all these area and this is why this is such a great time to be in performance engineering.
Want proof? Take a look at the following project some of our performance engineers tackled some months ago.
The web application to be tested was developed by the government of one of the largest country in the world. It was a form based application all adult citizens in the country had to use. We’re talking a massive number of users! The application was globally distributed across multiple private datacenters to offer failover option as well as provide the best user experience possible. The technical environment was your typical Struts application framework with an Oracle database for back-end. The target was to reach a clean test at 172,000 concurrent users using the application on the production environment. Yes, you read it right. On a production environment! Our performance engineers run a fair portion of their tests on production systems, especially at the end of the application development life-cycle. This is the environment the application is going to be accessed from and problems you find at that stage can’t be found when the application is installed in a lab or in a staging environment (Loadbalancer problem, CDNs, Bandwidth etc.). If you want to learn more about performance testing in production, there is a webinar available.
This was a 2 months projects and our engineers had some time to perform state of the art performance testing: Starting with a low number of concurrent users (500), fixing issues at this low level before moving on to greater level ie. 1000, 5000, 10k, 50k etc. That’s a fairly typical approach but sometime overlooked by some of our customers. We usually educate them while going through our performance testing methodology.
Many problems were found and fixed before reaching 100k: Servers misconfiguration, oversized pages, poor client-side caching, SQL optimization, login process optimization (One of the typical optimization engineers have to deal with regularly as companies tend to retrieve all kind of heavy information during the login procedure).
One problem really drove our engineers crazy for a few weeks. (The type of problems they love as it is really challenging and rewarding to get them fixed!). At a fairly substantial level of concurrent users (110k+), they were observing the following real-time chart:
They ran hundreds of tests at this level and were getting plagued by the same issue: As soon as the first users were done filling up their form and submitting it, overall response time was getting to the roof, throughput dropped significantly and error rate reached 6.5%! Not a pretty picture and a really bad user experience: Spending 15 minutes to fill governmental forms and getting an error when submitting is not what most users want to go through.
After many days scratching their heads, the problem was finally identified. This application was distributed across the world with a global Load Balancer taking the requests and route it to local load balancers closer to the user. As it was a highly secured application, the global load balancer was responsible to serve a highly encrypted (2048 bits) certificate during the submit process. When it was serving this certificate, CPU on the load balancer would go sky high and the throughput would drop. Our engineers had the idea of reducing the encryption level to 128 bits as a test and sure enough the CPU level on the load balancer was normal and it was being able to serve the certificate as expected. But as soon as the encryption level reached 2048 bits, problems were back. We ended up contacting the load balancer manufacturer and with all the data collected during tests it was all the information they needed to provide a firmware fix.
This is the final test results with the firmware fix.
Very clean test! Very low average response time (440ms), extremely low error rate (0.001%!) and 172,000 virtual users getting through the process without a itch!
That’s what performance engineering is all about these days! Getting involved with large scale projects, learning the full scope of performance optimization, teaching customers performance best practices. And best of all, there is product innovation to bring engineers all this fun! CLOUDTEST! And soon a very big surprise for all performance engineers in the world … !
About the Author