There are two common metrics for web server performance:
- Requests processed per second or "RPS."
- Concurrent connected sockets or "Connections."
Generally, a high performance server will process many requests per second and perform well when many connections are live, so the two metrics are definitely
not mutually exclusive. However, they are measuring two different things.
Requests per second (RPS)
In today's web, the vast majority of interactions between clients and servers are via traditional HTTP dialogue composed of requests and associated responses. While momentum is happily building behind WebSockets, adoption remains relatively low.
A canonical HTTP dialogue involves the creation of a TCP socket, the client sending a request to the server, the server doing some work in order to furnish data to send as a response, and finally tear-down of the socket. Today it is now conventional to use HTTP "keep-alives" wherein the client and server agree to not tear down the socket immediately, allowing subsequent requests to be sent on the same socket (with some constraints).
And in some cases requests are pipelined—later requests are sent even before the server has responded to prior requests, thereby avoiding some round-trip latency. Pipelining is used in scenarios where a set of requests is known to be necessary regardless of the server's response, such as collecting a bunch of static assets.
Theoretical ideal for RPS: 0 response time
From a requests-per-second perspective, the theoretical ideal that all servers aim for is zero response time. Of course, zero is an unachievable, so the goal is simply to minimize the duration of a request.
Therefore, when building a dynamic web application—be it something with server-side composition of HTML or a JSON/REST API—a consistent goal of the application developer is to minimize the time spent processing each request. Again, the theoretical ideal is to process each request in 0 milliseconds. The closer to instantaneous, the better. A developer might be happy with 20 ms, satisfied with 200 ms, troubled by 2,000 ms, and panicked about 20,000 ms.
After all, from a reductionist point of view, a request that requires twenty seconds to process might be fully consuming one of the server's CPU cores for that entire time, meaning other users' requests are getting queued waiting for service. This type of situation tends to snowball, leading to widespread request timeouts or errors.
Concurrent connections
Thanks to the popularity of a scalability challenge called the "
C10k problem" (or, more verbosely, the "10,000 concurrent connections problem"), a variety of servers and platforms have provided mechanisms for establishing and maintaining a large number of live HTTP sockets. Among these, popular wisdom points to
nginx and
node.js as two platforms particularly well-suited to large numbers of concurrent connections. It turns out there are several.
Benchmark exercises have been crafted to simulate extremely large numbers of concurrent live HTTP sockets. Node.js has famously been shown to service at least 250,000 sockets while http-kit on Clojure was recently shown to service
600,000 live HTTP sockets.
That is a ton of sockets. The exercise even requires some configuration trickery to get past the 65536-count limitation on active sockets per IP address. Whomever designed TCP sockets clearly didn't anticipate http-kit.
Theoretical ideal for concurrency: idle
Ask yourself: why would it be necessary to service 10,000, 100,000, or 600,000 connected sockets concurrently? Presumably to provide some level of real-time delivery of data from the server without a prior request from the client. That's a WebSocket use-case. Whereas traditional HTTP requests can only originate on the client, WebSockets allow data to be sent from the server to a connected client without the client knowing to ask for the data (that is, without polling).
If you're maintaining that many concurrent connections and you're
not using WebSockets, then at the very least you're doing something that is approximating WebSockets by way of polling. You're leaving the connection idle for a period of time and periodically sending an HTTP request (on a keep-alive socket) from the client to ask for data.
If you're not doing that either, then concurrent connected sockets may not be relevant to you at all. You're probably better off resolving each clients' request with a response as fast as possible so that you can release the socket and be prepared to handle another client's subsequent request.
All of this is because the theoretical ideal for maximizing concurrent connected sockets is to have all of these sockets be completely idle. As with the theoretical ideal for RPS (0 time), this is unrealistic. In the real world, you have these connections live because you have to do
something with them. Once you add that work that must be done into the mix, the problem reduces back down to making that work happen as quickly as possible.
In other words, in both cases, it's optimal to be able to execute the hard work of providing data to clients as quickly as possible. If you have 600,000 connected sockets but your CPU cores are all saturated doing work, all of those connections will starve for data and it was for naught.
Is concurrency of mostly-idle connections hard?
Well, it's not trivial. But demonstrating extremely high concurrency of mostly-idle connections is not particularly interesting unless you are building a high-client low-computation WebSocket-enabled system (or can conceive of some other use-case for keeping sockets alive despite their low utilization).
Risk of incorrect priorities
If we emphasize concurrent connections over RPS today, when WebSockets remain still (sadly) quite uncommon, we may be unwittingly focusing on the less important metric.
The hard part for servers remains resolving the data needed by clients as quickly as possible.
Servicing a large number of mostly-idle connections
is valuable for some use-cases. If that is your use-case and you're seeking a server that can handle a high number of concurrent connections, you're on the right track. But that remains an uncommon use-case versus the much simpler matter of responding to traditional HTTP requests as quickly as possible.
If you are building a (still) traditional web application but seeking a server that handles a high number of concurrent idle connections, you may be looking in the wrong places.
However, to be clear: the platforms that succeed at one of the two metrics are often incidentally quite capable at the other. But there may be a better fit if you know which metric is actually meaningful for you.
Our recent benchmarks
In the recent
web application framework benchmarks that my colleagues and I put together at TechEmpower, we were measuring HTTP requests per second.
(We'd like to add a WebSocket suite of tests, but haven't done so yet.)Measuring RPS explains why the concurrency levels we're using in our load simulation tool (originally WeigHTTP, but as of this week,
Wrk so that we can get latency measurements) are such
apparently low numbers as 64, 128, and 256.
It's easy to wonder why we didn't test at high concurrency numbers like 8K or 100K. The short answer is because the HTTP servers would simply queue up the requests and quickly begin dropping requests on the floor with 500-series errors. In (nearly) all cases, the load-generating client can fully saturate the server's CPU cores with 256 or fewer concurrent requests.
A load tool such as Wrk ensures the target concurrency level (say, 256) by creating 256 requests, and then from that point forward, creating exactly one new request as soon as any one of the current requests completes.
Again, once the client's request load is sufficient to saturate the server's CPU cores, there's nothing more to gain by increasing client-side concurrency except to exercise the server's small request queue and then laugh at its inevitable misery and failure.
Server-side concurrency fan-outs
As a related matter, modern web application frameworks are rapidly adopting design patterns that emphasize server-side fan-out of concurrency. I want to briefly discuss this to at least establish some reasonable expectations for anyone who has not observed web application behavior under high load first-hand.
Because they are used by multiple users, web applications are natively—without additional work—concurrent. (There are some exceptions, but those are generally egregious design failures and not typical.)
If your server has eight CPU cores and you receive one request each from eight users at precisely the same moment, each request will be handled in parallel—one on each core. That is the natural concurrency of web applications. Furthermore, thanks to some aspects of your web application going idle for moments of time (for example, as it waits to receive results from a database query), and also by the fairness of the operating system's thread scheduler, a single CPU core will service many requests concurrently.
As the computational complexity and frequency/concurrency of requests goes up, the CPU utilization across all cores will increase until ultimately all cores are fully saturated.
At this point, the web server will typically fill up a small request queue if additional requests arrive in an attempt to give some breathing room for currently-processing requests to complete. But if the request frequency doesn't taper off, even that queue will fill up and ultimately errors (500-series HTTP errors) will be returned to clients.
In their effort to scale an application that is routinely saturating its CPU cores, some developers will be swayed by currently-popular architectural approaches such as making their application more "asynchronous," which usually means either fanning out work to additional threads on the server or attempting to replace threads with single-threaded event loops.
Unfortunately, when your server's CPU cores are already fully saturated by the natural concurrency that comes from being a web application, fanning out the computational complexity of each request using additional thread pools will not increase performance. In fact, performance may be slightly diminished due to the increased overhead.
Single-threaded event loops may give some very small benefit in throughput versus threading but the fundamental computational cost of the work will not get smaller.
When designing a web application to maximize the utility of your server's CPU power, the best thing you can do is
simply make your code fast. That might sound absurdly reductionist, but making code fast is often what's secretly at play when people are happily surprised by a major leap forward in performance after switching their application architecture.
Happily surprised by increased performance after moving from Rails to node.js? Thank Google's V8 JavaScript engine for simply being wickedly
fast compared to Ruby. The asynchronous stuff is simply icing on the cake.
Caveat: low-utilization
An important caveat to the above is that for servers with
low utilization, server-side fan-outs are in fact beneficial. The reason is that in this case—where you have a request running on CPU Core 1 while CPU Cores 2 through 8 are idle—if you can break the work of resolving the request into eight parts, fanning it out means you can resolve the request in (ideally) one eighth the total elapsed time.
In low-utilization scenarios, you are not fighting the built-in concurrency of web applications. So adding a second tier of concurrency is generally a good thing since it can improve your ability to respond to users' requests quickly.
The danger is in not realizing when the balance has tipped and you may be incurring additional cost and reduced throughput because of the fan-out. When the balance tips this way, you may achieve higher overall throughput if you reduce the architectural complexity and handle the work of resolving each request sequentially as fast as possible.