December 13, 2005 - 06:00 UTC
Okay - we're out of the woods as far as the current server issues. As with most things around here, the actual problem was well disguised and the eventual solution simple in essence.
Early monday morning, December 5, we started dropping connections to our upload/download server (kryten). See the posts below for more exposition. We shuffled services around, added a web server to the entire fray, tuned file systems, tweaked apache settings, all to no avail. We checked if we were being DOS'ed - we were not. Was our database a bottleneck? No.
Progress was slow because we were also fighting with the master database merge. And every fix would require a reboot or a lengthy waiting period to see if positive progress had been made.
By Friday we were out of smoking guns. At this point kryten was only doing uploads and nothing else - reading from sockets and writing files to local disk. What was its problem? We decided to convert the file_upload_handler into a FastCGI process. I (Matt) applied the conversions and Jeff figured out how to compile it, but it wasn't working. We left it for the weekend, and shut off workunit downloads to prevent aggravating the upload problem with more results in the mix.
When we all returned on Monday, David made some minor optimizations of the backend server code (removing a couple excess fstats) and I finally remembered that printf(x) and fprintf(stdout,x) are two very different things according to FastCGI. We got the file_upload_handler working as a FastCGI this afternoon.
We weren't expecting very much, since the file_upload_handler doesn't access the database. It basically just reads a file from a socket and writes it to disk. So the FastCGI version would only save us process spawning overhead and that's it.
But that was more than enough. We were handling only a few uploads a second before, but then with the FastCGI version handled over 90 per second right out of the box. Within a couple hours we caught up from a week of backlog. Of course, this put new pressure on the scheduler as clients with uploaded results want more work. We imagine everything will be back to normal come morning. We're leaving several back-end processes off overnight just to make sure.
Meanwhile the master database merge is successfully chugging along, albeit slowly, in the background.