Finding Bottlenecks Through Monitoring and Iterative Load Tests

Posted by on

A medium-size web development shop engaged us for its first experience with external, scale testing on a brand new application scheduled to roll out a few months after we started. The application is written in Ruby on Rails (ROR) with a MySQL backend, and serves as an internal social networking hub for a company. Through this application, users can create projects and tasks that are both work related and personal, invite others to join them in completing these tasks, share their calendars, and exchange instant messages with other online members. The front end is entirely JavaScript driven and consists of a number of widgets, all of which are updated through AJAX calls to the server. In addition, the client intends to host this application in the cloud, which would be a first for them.

The test consisted of 10 scenarios including simple browsing of the site, creating projects and tasks, and randomly sending instant messages to other online members. The most challenging part of test creation was dealing with the asynchronous nature of the application. There were no clear-cut events that broke the script into URLs or pages. There was only one URL followed by a number of AJAX calls to the server. CloudTest’s nested transaction grouping allowed us to put these requests into logical groups that represented all of the main business processes to be tested.

The first test was against a single server that contained all of the moving parts of the application and was hosted onsite. The application performed well until we reached 200 simultaneous virtual users. At that point, things got really slow. At about 300 users, the site was unusable and every request came back with a HTTP 500 error.

Our second test was against a more distributed environment. The application server and the database were split between two different machines, which were hosted in Amazon EC2. The site did slightly better, topping out at 300 users. Using CloudTest’s monitoring, we noticed that the database was not working very hard and that the web server was at nearly 100% CPU utilization for most of the test.

During subsequent tests we tried a variety of configurations. CloudTest makes it very easy to do iterative load testing. We spread the load even more by putting several application servers behind an EC2 load balancer and increasing the horsepower of the database server. Although most of these changes made a visible improvement, we were still faced with an obvious bottleneck. At roughly 800 simultaneous users, the application started to slow down, and at 1200 users it became almost unusable. At this point all of the usual suspects looked innocent. The app servers’ CPU never got higher than 50%, the database had plenty of breathing room and we were not limited by bandwidth.

When we looked at the “Collection Summary” report from CloudTest we realized that one particular transaction, which was taking longer to complete than others, needed more attention during the test. As you can see from the chart below, the “CREATE TASK” transactions took significantly more time to complete, even when there was little load on the system.

The ops team monitored the app server with New Relic, which has a ROR profiling module and is integrated into the SOASTA dashboards. We found that a CREATE TASK business process was using a ROR ORM Engine called Active Record to perform MySQL queries. A particular query in the CREATE TASK process was not “tuned”, and instead of running one or two SQL statements to retrieve the data, hundreds of queries were run each time a list of tasks was requested by the application. Although the queries themselves were very lightweight, running hundreds of them in sequence for each process slowed things down considerably.

Our findings were brought to the attention of the development team. The team was able to revise the process to access the same data with just one query to the database. As a result, in our subsequent tests we saw a great improvement in the CREATE TASK transaction. Monitoring the infrastructure during a load test once again proved invaluable, and having the results on the same timeline helped surface the issues much more quickly.

 

Comments are closed.