This was surprising to us, because we had certainly tested the ability to order our product, and customers were able to order other products without fail. It was also frustrating, because it was happening sporadically, and we weren't able to reproduce the failures on our own machines.
We found a stack trace in the logs:
Unexpected error: java.lang.NumberFormatException: For input string: "undefined"
at java.lang.Integer.parseInt(Integer.java:492)
at [redacted].getRotation
Weird. Somehow the rotation of certain images is being set as "undefined". It's easy enough to solve this error -- just throw a try/catch around our Integer.parseInt(...) call, and set default values if an invalid number is passed. But why is this error happening in the first place?We trace some code, and see that the value for rotation is set when images are first imported into our application. The initial import doesn't have all of the info we need (including rotation), so we make a service call to fetch rotation, along with other important image information. Is that service call failing somehow? We check the code -- everything looks good: no missing null checks, all of the lookups appear to be properly formatted, no apparent race conditions, and we're using methods that are used all over our website without fail.
Ok, so the code looks good. Could there be a server issue? Let's look at some error logs. Sure enough, our top error (by far) is a timeout issue on our fetchImageInfo call. In fact, 435 of the 652 errors we have seen over the past 24 hours -- that's 66 percent -- are related to that service call.
But how could our service be failing? We have passing Load and Performance tests that verify that the app can handle our expected load, and we're not exceeding our expectations (we're not even meeting them, as the app is in A/B testing and only being presented to 10% of our users), yet the errors we're seeing indicate that we're putting too much pressure on our service.
After digging into the Load and Performance tests, we found our problem. As you might have guessed from the title of this post, the problem ended up being that our L&P tests were not actually emulating our real world usage of the service.
Our service was designed to take an array of ImageIds in JSON format. It then fetched the appropriate image information, and returned an array of objects containing that image info. Our L&P tests grabbed 100 random test users, selected a random image album from each, and constructed an array of those images' IDs to be sent to the service, and then verified that proper data came back from the service. These test calls were run simultaneously, and the tests had thresholds for response time, throughput, and failure rate (<1).
The problem was that, instead of constructing an array of ImageIds, our application was sending each imageId to the service one at a time, as each image was imported. This meant that, instead of making one call to the service, the app was making n nearly simultaneous calls, which was quickly overloading the system if the user was importing a large collection of images. So a service that was expected to support <20 requests per second was sometimes being forced to handle hundreds of requests per second.
The solution, again, is somewhat easy: we construct a bundle of ImageIds as images are imported, and send that bundle to the service in a single call, then we inject the response data into the imported images.
The lessons learned are twofold: L&P tests are only useful if they emulate real-world situations, and developers need to ensure that they're optimizing their network traffic. In our case, we wrote our L&P tests to test the service as it was intended: processing bundles of imageIds. However, when the service was integrated into our application, laziness and our desire for simplicity caused us to request info for each image, one at a time, as it was loaded. After all, making individual calls meant we didn't need to uniquely track each image, as each response could be easily mapped to the requesting image. The proper solution meant we'd need to spend time looking up each item in the response, which meant we'd need some sort of map of IDs to images. The result of this misstep was that we had a service that, based on our tests, we thought could handle our expected load, but that buckled under the pressure of actual users.
No comments:
Post a Comment