Ran into an interesting problem recently.
My team created an HTML-based book preview application, called Slideshow, about two years ago. In one of its incantations, our app is embedded in a div on another page on our site. It looks like this:
This all worked well and good for many many months. About two weeks ago, however, a bug is assigned to us that says the navigation buttons are displaying as blank squares. Odd -- I jump onto our site to see if I can reproduce, and, sure enough, I see this:
The navigation buttons still behave properly, they just don't render on the page. I think maybe the assets are failing to load, but it appears that we're fetching, and successfully receiving, the proper graphic assets. So I inspect the elements, and there we find our problem:
The CSS for Slideshow's navigation buttons are being overridden by the host page's CSS, which has a global style giving all <a> tags white backgrounds and grey borders.
I check the revision history for the host page's CSS, thinking someone must've recently added this global anchor-tag style, but the css file hasn't been changed in months. What about Slideshow's CSS? Same story: it hasn't been changed in months. If none of the relevant CSS has changed, how could this have broken?
I search through some code, and I see that Slideshow's CSS is programmatically added to the end of the <head> element via JavaScript. Viewing the source of the host page, I see the offending CSS file is included within the <body> of the page. That sounds like it might be our problem: I'm betting that, even though Slideshow's CSS is added to the page last via a function call, it's being stomped on by the host page's CSS because it resides in the <body> of the page.
I investigate the host JSP: It includes a file called commonHeader.jspf, which basically looks like this:
<head>
<content tag="css">
<link rel="stylesheet" type="text/css" href="{our offending css}" >
<link rel="stylesheet" type="text/css" href="{some other css}" >
<link rel="stylesheet" type="text/css" href="{some other css}" >
</content>
<content tag="text/javascript">
{some javascript imports}
</content>
</head>
Strange -- it looks like the offending CSS *is* within the <head> element as defined by the JSP, but, when it is compiled into the served page, it ends up in the <body>. How does that happen? Well, it turns out we use sitemesh to decorate our page, and there is a totally separate header file that is automagically included in all of our site's rendered pages. That header file invokes a special tag to load content into itself (note the <content> tags in the above .jspf code), which means that the <head> and </head> tags in commonHeader.jspf are completely ignored by our site. Fun!
I investigate this newly found header file, and I see that someone made a change about a month ago that moved the tag that includes the "css" content from the <head> element to the <body> element. I swapped it back, and our Slideshow looked beautiful again. Hooray, our assumptions were correct! Loading the host page's CSS from within the <body> was causing it to overwrite Slideshow's CSS, even though Slideshow's CSS was loaded later.
So what did we learn from this adventure? First, we learned that our JSP rendering frameworks are not easy to trace, as it took an entire day's worth of searching -- and a little bit of luck -- to find the offending code. But, more importantly, we learned that it's a good idea to stick to the accepted standard of loading all external CSS files within the <head> of your page, even though most browsers, if not all browsers, don't strictly enforce it.
I did some Googling to see what other developers though about loading CSS in the <head> vs the <body> of a document, and most of the results spoke about best practices and page load-time (it turns out the browser is forced to re-render a page when it finds CSS in the middle of the document), which are great reasons in themselves. However, the issue we ran into shows that there can be actual render problems when CSS is included in the <body>, especially if other code is depending on that CSS to be located in the page's <head>, as Slideshow did when it added its own CSS to the dom.
What are your thoughts? Have you run into similar issues? Or, from the other side, have you ever found cause to include CSS, or JavaScript, within the <body> of your page? Tell me about it in a comment below!
My musings about all things code -- things I'm learning, problems I've faced, and interesting tips and tricks.
Friday, June 19, 2015
Monday, June 15, 2015
Load and Performance Tests Should Emulate Real-World Situations! Also, Don't Be Dumb.
We've been working on a new application to allow customers to order prints of photos, and recently released it to production. After a week or so, we began getting notifications from Customer Service saying that customers were having trouble ordering our product. They were able to create the product in our application and add the product to the cart, but were seeing errors when they tried to check out.
This was surprising to us, because we had certainly tested the ability to order our product, and customers were able to order other products without fail. It was also frustrating, because it was happening sporadically, and we weren't able to reproduce the failures on our own machines.
We found a stack trace in the logs:
We trace some code, and see that the value for rotation is set when images are first imported into our application. The initial import doesn't have all of the info we need (including rotation), so we make a service call to fetch rotation, along with other important image information. Is that service call failing somehow? We check the code -- everything looks good: no missing null checks, all of the lookups appear to be properly formatted, no apparent race conditions, and we're using methods that are used all over our website without fail.
Ok, so the code looks good. Could there be a server issue? Let's look at some error logs. Sure enough, our top error (by far) is a timeout issue on our fetchImageInfo call. In fact, 435 of the 652 errors we have seen over the past 24 hours -- that's 66 percent -- are related to that service call.
But how could our service be failing? We have passing Load and Performance tests that verify that the app can handle our expected load, and we're not exceeding our expectations (we're not even meeting them, as the app is in A/B testing and only being presented to 10% of our users), yet the errors we're seeing indicate that we're putting too much pressure on our service.
After digging into the Load and Performance tests, we found our problem. As you might have guessed from the title of this post, the problem ended up being that our L&P tests were not actually emulating our real world usage of the service.
Our service was designed to take an array of ImageIds in JSON format. It then fetched the appropriate image information, and returned an array of objects containing that image info. Our L&P tests grabbed 100 random test users, selected a random image album from each, and constructed an array of those images' IDs to be sent to the service, and then verified that proper data came back from the service. These test calls were run simultaneously, and the tests had thresholds for response time, throughput, and failure rate (<1).
The problem was that, instead of constructing an array of ImageIds, our application was sending each imageId to the service one at a time, as each image was imported. This meant that, instead of making one call to the service, the app was making n nearly simultaneous calls, which was quickly overloading the system if the user was importing a large collection of images. So a service that was expected to support <20 requests per second was sometimes being forced to handle hundreds of requests per second.
The solution, again, is somewhat easy: we construct a bundle of ImageIds as images are imported, and send that bundle to the service in a single call, then we inject the response data into the imported images.
The lessons learned are twofold: L&P tests are only useful if they emulate real-world situations, and developers need to ensure that they're optimizing their network traffic. In our case, we wrote our L&P tests to test the service as it was intended: processing bundles of imageIds. However, when the service was integrated into our application, laziness and our desire for simplicity caused us to request info for each image, one at a time, as it was loaded. After all, making individual calls meant we didn't need to uniquely track each image, as each response could be easily mapped to the requesting image. The proper solution meant we'd need to spend time looking up each item in the response, which meant we'd need some sort of map of IDs to images. The result of this misstep was that we had a service that, based on our tests, we thought could handle our expected load, but that buckled under the pressure of actual users.
This was surprising to us, because we had certainly tested the ability to order our product, and customers were able to order other products without fail. It was also frustrating, because it was happening sporadically, and we weren't able to reproduce the failures on our own machines.
We found a stack trace in the logs:
Unexpected error: java.lang.NumberFormatException: For input string: "undefined"
at java.lang.Integer.parseInt(Integer.java:492)
at [redacted].getRotation
Weird. Somehow the rotation of certain images is being set as "undefined". It's easy enough to solve this error -- just throw a try/catch around our Integer.parseInt(...) call, and set default values if an invalid number is passed. But why is this error happening in the first place?We trace some code, and see that the value for rotation is set when images are first imported into our application. The initial import doesn't have all of the info we need (including rotation), so we make a service call to fetch rotation, along with other important image information. Is that service call failing somehow? We check the code -- everything looks good: no missing null checks, all of the lookups appear to be properly formatted, no apparent race conditions, and we're using methods that are used all over our website without fail.
Ok, so the code looks good. Could there be a server issue? Let's look at some error logs. Sure enough, our top error (by far) is a timeout issue on our fetchImageInfo call. In fact, 435 of the 652 errors we have seen over the past 24 hours -- that's 66 percent -- are related to that service call.
But how could our service be failing? We have passing Load and Performance tests that verify that the app can handle our expected load, and we're not exceeding our expectations (we're not even meeting them, as the app is in A/B testing and only being presented to 10% of our users), yet the errors we're seeing indicate that we're putting too much pressure on our service.
After digging into the Load and Performance tests, we found our problem. As you might have guessed from the title of this post, the problem ended up being that our L&P tests were not actually emulating our real world usage of the service.
Our service was designed to take an array of ImageIds in JSON format. It then fetched the appropriate image information, and returned an array of objects containing that image info. Our L&P tests grabbed 100 random test users, selected a random image album from each, and constructed an array of those images' IDs to be sent to the service, and then verified that proper data came back from the service. These test calls were run simultaneously, and the tests had thresholds for response time, throughput, and failure rate (<1).
The problem was that, instead of constructing an array of ImageIds, our application was sending each imageId to the service one at a time, as each image was imported. This meant that, instead of making one call to the service, the app was making n nearly simultaneous calls, which was quickly overloading the system if the user was importing a large collection of images. So a service that was expected to support <20 requests per second was sometimes being forced to handle hundreds of requests per second.
The solution, again, is somewhat easy: we construct a bundle of ImageIds as images are imported, and send that bundle to the service in a single call, then we inject the response data into the imported images.
The lessons learned are twofold: L&P tests are only useful if they emulate real-world situations, and developers need to ensure that they're optimizing their network traffic. In our case, we wrote our L&P tests to test the service as it was intended: processing bundles of imageIds. However, when the service was integrated into our application, laziness and our desire for simplicity caused us to request info for each image, one at a time, as it was loaded. After all, making individual calls meant we didn't need to uniquely track each image, as each response could be easily mapped to the requesting image. The proper solution meant we'd need to spend time looking up each item in the response, which meant we'd need some sort of map of IDs to images. The result of this misstep was that we had a service that, based on our tests, we thought could handle our expected load, but that buckled under the pressure of actual users.
Subscribe to:
Comments (Atom)


