I’ve been spending some more time today load-testing webserver software in preparation for the online release of the next Harvey Danger album. When I “wrote last”:http://www.geekfun.com/archives/000612.html I’d just finished looking into resource utilization of “Apache2”:http://httpd.apache.org/docs/2.0/ (with both the prefork and worker multiprocessing modules), “thttpd”:http://www.acme.com/software/thttpd/, and “lighttpd”:http://www.lighttpd.org/.
Today I’ve been taking a closer look at lighttpd and thttpd. What I’ve found has been sort of a mixed bag.
After my first round of tests, I was favoring lighttpd, but still considering Apache2 with the prefork MPM because it was “good enough,” and probably better understood. As you can see from “the comments”:http://www.geekfun.com/archives/000612.html#comments on that post, it didn’t take me long to doubt my judgment about Apache being good-enough for this application. This doubt came largely from a shift in my thinking about what a worse case scenario would be. One of the comments in my blog pointed out that too many users with slow connections could require high connection counts even before the servers available bandwidth became a limiting factor. I’m not really comfortable with the idea of apache running 2000 threads if we get a spike of 2000 concurrent users, nor do I like the thought of queuing up 1000 connections and leaving them idle to keep the threadpool size down. This might just be superstition, but I don’t like it.
So, my worry has now turned to lighttpd, and whether it is really up to the task. In my earlier tests, memory utilization with lots of connections was exceedingly reasonable, but I worried about the possibility of a memory leak. One or two people on the lighttpd mailing list had complained that their processes would swell to 500MB or more after a week or so under heavy load, and they didn’t seem to be getting any help with the problem. I set out to see what would happen to memory over longer periods that I’d used in my earlier tests.
For these tests, I used one of the actual servers we’ll be hosting on. Jeff ordered it last weekend and it was set up in a few hours. I spent some time upgrading it to Debian 3.1 in order to get the 2.6 kernel. Today I downloaded the latest release of lighttpd (1.3.16), Siege:http://www.joedog.org/siege/ (2.61) and thttpd (2.25b) and compiled them all.
First up, I ran tests on lighttpd. Siege, the load testing software, and lighttpd were both running on the same system to maximize throughput. The results were interesting.
It was hard to get really high levels of concurrency. Running the tests through the loopback interface was so fast that even with a 60MB file, a large number of the virtual clients were in refractory mode between requests at any given time. Add to that some weird and worrying behavior with as few as 80 virtual users where some connections would seem to be ignored for long periods of time, even as other connections were being serviced.
The tests were still interesting, I tried lots of different variations of tests over the course of seven hours or without killing the lighttpd process. Over that period, there was 5.28TB of data transferred via over 1M requests (if the math doesn’t work out, its because I also threw some small files into the mix along with the 60MB file). That’s almost 3x the traffic we plan on pushing a month with a single server (the cost per MB goes up after we go through the initial 2TB transfer allotment that comes with the server). Even with all that data and all those connections, lighttpd didn’t grow to more than 6000kb resident, 1500kb shared and 6900kb of virtual. Very good news.
Less good were the problems I mentioned a bit earlier. If I had anything above 50 or so virtual users downloading the 60MB file there was a good chance that new connections for files (small or large) would end up in limbo indefinitely, never being serviced with data. Stranger still, Siege was able to start new downloads of files. Most would be serviced, but some would sit idle for too long and get timed out after 3+ minutes. At the same time, I was running Jmeter on my home machine with 10 threads requesting a single small file and they all stop, waiting for data for long periods of time. The only thing that would clear things off and get new connections working again was to cease all downloads of large files.
I’m not sure what to make of this behavior. I had lighttpd configured to use the linux 2.6 kernel specific epoll IO event handler for most of the tests. Suspecting this might be the problem, I used poll instead. I didn’t test it as extensively, but this seemed to work better for the most part. There were fewer timeouts and no connection seemed to be entirely neglected, but it still seemed twitchy. I also wonder if this might have been influenced by running everything on one machine with client and server sharing the same TCP/IP stack, and with incredible effective bandwidth (well over 1Gbps). On the other hand, I saw some odd behavior with the tests I was running on my home network that might be another manifestation of the same problem. I’m going to have to try those tests again.
I also ran some tests against thttpd. It doesn’t seem to be able to push data over the loopback interface as fast as lighttpd, topping out at maybe 150MByte/s, while I was seeing as much as 250Mbyte/s with lighttpd. Of course, I’ve already established that both of them can saturate a 100Mbps connection. I haven’t run thttpd as long, but it looks like its memory consumption is pretty steady too, which I’d expect from a piece of high-performance software that’s been around as long as it has.
I’m also realizing that I’d probably misinterpreted its memory consumption on my earlier outing. Virtual and Shared memory seems to go up in proportion to the size of each file opened. Resident memory is related to the size of the files being actively downloaded. Downloading a 60MB file via DSL won’t have a big impact on the resident size, but multiple clients downloading that file over a really fast interface, or the loopback address, will drive the resident size of thttpd up to the file size + a little overhead. This is apparently due to thttpd’s use of memory-mapped IO, and the way the Linux kernel is managing the disk cache and virtual memory. When a single client is reading a file, its not enough to cause the whole thing to be loaded into RAM, but as concurrency and download rate goes up, more is kept in real memory. In my LAN tests I think it was about 45MB, probably reflecting a combination of free memory in the system (only 512MB of RAM compared to a GB in our webserver), and the rate at which it was being fed to clients.
Thttpd is tempting for its maturity, but from what I’ve been able to read, it doesn’t support HTTP resume, which allows people to restart partial downloads without downloading the whole file again. This would be a very good thing for us, since it would suck to end up resending big parts of files to people if their are problems with the download. It would just compound our problems if, despite our best efforts, we run into issues with the server, and it will eat into our bandwidth allotment. Also from poking at the source code, it doesn’t use the linux sendfile API. There is a sendfile patch available, but its for an older version.
In any case, I’m going to run the stock version overnight and see how things look in the morning.