Thursday, May 24, 2007

Server Postmortem (plus graphs)

Upon solving a clue, teams would call in the answer to an automated server, which would tell them where to go next, assuming they were correct. During the course of the game, the server received almost 2000 calls from 66 different phones. Here are some graphs of interesting data:


The phone load graph is particularly interesting, as you can see when the phone server blew up on the first clue, as well as when teams started to hit the Virus clue or when they left a bonus site.

Note: I am a complete gnuplot n00b, and so if you have suggestions as to how to make the graphs look better or suggestions on other data that might be interesting to graph, drop me an email at offpath@gmail.com.

The System

Well in advance of the game, we purchased VoIP service from VoicePulse Connect. They have a deal that's really nice for planning a game, in that the first 4 channels are only $11/month. This allowed us to work on and test the server for a period of 4-5 months for very little money at all. Then, a week in advance of the game, we upgraded to 8 channels, just in case we got a lot of teams on the Virus clue at the same time.

On our end, I ran a Linux box with Ubuntu, Asterisk, Apache and MySQL. When someone called in, a python script triggered by asterisk looked up their phone number in the mysql database, associating it with their team, as well as the clues they were currently on. All guesses and advancements were logged through the database.

Upon solving a clue, the server would lookup the next site on the route that was neither closed nor marked compromised and move the team there. At two points in our game, we had branches where teams followed different routes to reduce the load on certain sites. The server was able to dynamically route teams based upon how many teams it had sent down which route.

Since we were able to log everything in a central place, unlike a palm based system, GC could tell where all teams were headed at any given time. We were also able to change the route at a moment's notice if necessary. This allowed us to have backup sites in case of rain and to mark sites as compromised if something went wrong. Corey (of The Burninators) had told me how useful this would be, and I don't think it was until we were actually running the game that I realized it.

On the GC end of things, we had a set of mod python psp scripts running on my apache server. This let us lookup the location of teams and add notes every time they called. We also had a giant leaderboard, which for a giant table of very slowly changing numbers, was amazingly interesting to watch.

What Worked
  • As I said before, the server answered almost 2000 calls. That is 2000 calls that GC did not have to manually answer and gave GC a surprising amount of down time.
  • Both Twisters Gym and the Bank Heist were restricted in terms of the number of people we could have on the clue at a given time. In this case, we split teams across 3 and 2 sites respectively. Handling this sort of routing would have been very difficult manually, but it happened seamlessly through the server.
  • At several points, we had to have backup sites in case it rained or in case we couldn't use a building at Stanford. We actually had to use the backup site, and we had to change site closing times in a few other cases. Each of these actions was pretty easy to do over a web interface.
  • As with most recent games, having an automated system allowed us to use arbitrary words as answers, making it easier to use various encodings, and making clue writing mostly independent of route.
  • Because the leaderboard was a website, it was accessible over the internet and all GC members out in the field with an internet enabled cellphone could see where all of the teams were and how long they'd been there.
What Didn't
  • The server on the first clue. The last 5 teams to leave Plaza Del Sol had to be manually routed because a hoard of rabid squirrels attacked my server. I've poured over the logs generated by asterisk, my python scripts, and mysql, and for the life of me, I can't figure out what happened. Somehow, a runaway mysql process began eating 100% CPU, and for lack of a quicker fix, I had to restart the whole server. After that, it worked fine--go figure. Then I had the fun task of cleaning the bad data that got entered and manually fixing things to route the teams on the server where we had told them to go over the phone.
  • The server basically tied me to my apartment. We had generally planned to have me around GC for most of the game, but after that first snafu, it became clear that I really couldn't leave. Despite being a team of too many coders, I was the only one familiar enough with my code to fix it if it broke. If we had it to do over again, I'd have more actively tried to distribute the server knowledge.
  • There's nothing like 20 teams running through the game to test the code. Obviously, I should have written more tests, but I write code all day, and as much as I love coding, I don't always get home itching to do more of it. Other than the big crash, Here Be Dragons was incorrectly skipped over charades. Fortunately, the leaderboard acts as a manual double-check. I actually got calls from 2 GC members out in the field before I was able to fix this. All I have to say for myself is that 3-value logic is a scourge that should be cleansed from the land.
Wrapup

We really liked the server. It was a lot of work before hand to put it together, but it really paid off on the day of by allowing us to do creative re-routing on the spot and by taking some phone load off of GC. It had it's bugs, but none of them were fatal. I'd highly recommend a centralized server system to other new teams who have a coder or two on their team. It takes a good amount of uncertainty and guess-work out of the route.

1 comment:

Darcy said...

I certainly had no idea that a backup site was used, so from the end-user perspective it was clearly a seamless transfer.

The you-being-the-only-one-who-knew-the-server-software sounds very much like our problem in TAZ of Ian-being-the-only-one-who-really-understood-the-puzzles such that people would phone me with questions and I'd be like, "uhh, I actually have no idea how that's supposed to work."