2013-08-11

Unfair comparisons

The following are my personal opinions. I am not speaking for my company.

When we posted the first results from our framework benchmarks project, we anticipated indifference and silence. We were pleasantly surprised—blown away really—by the response.

The response has been predominantly positive (thank goodness), with some inquisitive, and a smattering of negative.

Of the inquisitive responses, common is the question of mismatched or unfair comparisons, sometimes posed as a statement about the resulting incorrectness or invalidity of the project.

Why have you compared MySQL results to Mongo results?
Why have you compared full-stack frameworks to platforms?
Why have you compared EC2 m1.large to i7?
Why have you compared Rails to Servlets?
ASP.NET to Go?
Django to Express?
Or any other seemingly-crazy pairing?

Why not?

I don't mean this flippantly. But if you ask why do you compare X and Y? you should consider the opposite question: why not?

Your opinion may be that a particular comparison is unfair or irrational, but opinions and needs vary. We've seen too many interesting reader responses delighted by discovering unexpected comparisons to suppress it. For example, some readers consider this project a comparison of physical hardware versus AWS. That wasn't our intent, but why not allow them to make those points?

I've also seen many conclusions I don't agree with personally, but I don't want to change the way the results data is represented to stop people from concluding something I don't agree with.

A while back, a reader asked if a test with an embedded database would be accepted. That's an interesting idea, I said. Our scripts aren't ready to conduct such a test and the results user interface doesn't have the proper filters, but I don't see why we couldn't do that in the future.

I personally wouldn't run a web application with an embedded database. Or at least today I can't think of the use-case. Maybe I would some day. The point is who am I to say it's absolutely invalid?

If you are going to give me data and I can choose to read it or ignore it, why would I say you should not have collected the data at all? Data is not invalid simply because I don't want or need it.

For the sake of argument, imagine the resulting performance data from an embedded database test was ultra high. That would surprise me and I'd be thankful that I now know. I'm still not necessarily moved to use an embedded database, but it's good to know how it performs. I can leverage that later if a fitting use-case lands on my desk. In what universe would it be better to simply not have the data at all?

Yes, there is the matter of setting priorities, and we haven't made testing with embedded databases a priority. But if all the necessary pieces—including the results rendering—were magically done and all I had to do was hit the GitHub "merge" button, why would I refrain?

This is not one-on-one

I now embrace the cliche, apples versus oranges because that is what we're doing. Not a single apple versus a single orange though. We are comparing a bunch of apples and a bunch of oranges. In fact, we've got a veritable fruit stand here. Some bananas, pears, and apricots have been contributed by local farmers.

Again, without the fruits metaphor: we're comparing a bunch of platforms and full-stack frameworks. In fact, we've got all types of HTTP servers. Some micro-frameworks, some general purpose network platforms (e.g., Netty), and some special-purpose frameworks (e.g., content management systems) shoe-horned to run our test types. At this point, most have been contributed by subject matter experts or fans of each framework via pull requests.

It's not just Netty versus Sinatra. It's not just Tornado versus Servlet. It's the whole spectrum, each on their own. You can compare any subset you want.

Not all readers are the same

If you'll humor me providing some advise to those seeking to protect the minds of newbies:

Do you want to remove unfair comparisons because you feel an uninformed reader may not be sufficiently savvy to interpret the results—to understand why the numeric results alone are not enough to form a conclusion? You may be trying to save novice readers the potential mistake of comparing two (or more) things that they should not be comparing given their skill level without the necessary background.

If that's the case, I'd ask that you change your tone to cautionary but not preventative. Preventing the comparison results in a narrowing of perspective. Novices will listen to you, narrow their perspective, and it may take them considerable time and a unnecessary barrier of hesitation to later re-open their perspective when the time comes. Instead, simply caution that there is more to the results than numbers. Novices should take a look at the source code and configuration to get a taste of what's behind the scenes. They should evaluate the developer community for each framework and consider their comfort level.

With respect to this matter of comparing dissimilar things, we've identified at least three categories of consumers of the results data:

Novice readers who may reach an (obviously?) silly conclusion. From now on, I'll use Servlet for my web applications! This is the type of response the next category is trying to suppress. Note that a novice is also likely to conclude: interesting data, I'll dig into that more later. Novice doesn't mean dumb, and we should stop pretending that it does.
Advanced readers who perhaps remember the pain caused by badly interpreting a mismatched comparison earlier in their career. They want to provide a teaching moment but do so with sweeping generalizations such as "you can't compare full-stack frameworks to platforms! GTFO!"
Advanced or expert readers who find genuine value in comparing diverse results. A framework author probably will want to target the high-water mark set by the underlying platform. A developer of a small service may be open to considering anything (a framework or platform) in order to achieve maximum throughput and minimum latency. A savvy reader may have already worked with MySQL, Mongo, and Postgres and wants to see a comparison of real-world peak throughput. Bottom line: some readers have wanted to make cross-cutting comparisons like these but are busy people, so they appreciate receiving data from others.

For our project, the default view is wide-open, allowing all reader types to digest the data as they see fit. Yes, some novices will make funny conclusions and other readers can caution them. But suppressing that discovery not only hurts their ability to learn precisely how dissimilar comparisons can lead to expensive mistakes, but it also lowers the value for other advanced readers.

We're not going to change our results view to disallow wide comparisons. Maybe some day we will change the default view to be a narrower subset, but that remains an open question. I fear it may taint the reader's interpretation, predisposing them to a popular point of view, and as much as possible I'd prefer the results view to be just-the-facts.

When instructing novices, instead of "you may not," or "you cannot," use phrasing such as "be careful when," or "look beyond just the numbers." Maybe go so far as bluntly asking them a leading question such as, "Do you understand the difference between Postgres and Mongo?"

This isn't academic; it's not science

We're not academics. We didn't publish the results in a journal. We don't receive funding to do this.

We wanted actionable data about the real-world production-grade performance of a wide variety of web platforms and frameworks.

We've tried to be disciplined in our approach and procedures, but we don't have the time to set up control groups, create and audit a robust taxonomy of software attributes, and then divide the data into processed slices for static consumption. To-date, we have captured the data, done some sanity checks, fixed some tests, re-captured, and eventually when we feel we've caught enough of the stupid mistakes, we've put the data out for others to view. The round-over-round progression renders the data temporary, meaning errors can and will be corrected in subsequent rounds, most often by community contribution.

I'm not saying that is a good or better approach, just that it is what it is.

A desire to segregate dissimilar inputs (e,g., platforms from full-stack frameworks) may in part come from an academic predisposition of some readers.

If this were an academic project, I cringe at the challenge of taxonomy alone.

Consider Express, for which we figured the canonical data store is Mongo. Our options were:

Exclude Express because all other implementations at that time used MySQL. This would have sucked if you want to see how Express performs.
Include Express with Mongo only. This would have sucked if you think MySQL should not be compared to Mongo.
Include Express with MySQL. This would have sucked if you are a fan of Express and feel MySQL is not canonical.
Include Express with both Mongo and MySQL tests, allowing the reader to decide what to look at.

We went with the last option. Given infinite time, an ideal is full coverage of all permutations with the necessary user interface tools to filter accordingly.

Web development is too opinionated a field to academically specify right and wrong.

Web-app taxonomy is a continuum

We've attempted to slot each implementation into various pigeon-holes. Classification (full-stack, micro-framework, platform); ORM type (full, micro, raw database connectivity); and so on. But this process assumes no gray areas. In reality, each framework is its own snowflake so errors and disagreements are expected.

Rigidly disallowing comparisons would make the inevitable classification failures more than just disappointing. They would be frustrating: You've categorized node.js as a platform so therefore I can't even compare it to Flask?

Aside: nothing beats testing your app

Some dismiss benchmarks because nothing beats testing your application. That's true. If you could implement your application on dozens of frameworks and platforms and then test it, you'd be in a great position to make an expertly-informed decision.

However, I don't have the time to implement my application on a bunch of frameworks to see what performs well and what does not.

Instead, I must select a platform and framework based on a combination of factors, one of which is proxies such as performance measurements doing work that may be similar to the work done by my application.

We've intentionally crafted tests that are simple. Being simple has two features:

The tests are relatively easy to implement across a wide spectrum.
The tests set what can be considered high-water marks for performance.

If you are doing one query per request, it's unlikely your query will be simpler than what we've tested, so your app is not going to perform better than our high-water mark. If you are going to render a response using a server-side template, it is unlikely to be smaller than our test. With such high-water marks, you know the performance wall built by your platform and framework. It's the very maximum you could theoretically achieve in your application after aggressive optimization. Real-world applications are likely to perform at 10% or even much less than theoretical maximum because no one optimizes out of the gate.

That may be useful information to you. It may not, of course. But why dismiss its value for others?

About this blog