After the initial introductory speakers, the conference quickly became very technical with a preso by Chris Olston of Yahoo! on Pig, the system for performing data analysis on large quantities of data distributed over a Hadoop cluster. It’s basically the equivalent of Sawzall, if you’re at all familiar with Google’s technologies. One of the instigators for a system such as Pig is that many times developers are writing the same sorts of joins, sorts, and merges over and over again. Chris says there’s even a mailing list at Yahoo for sharing M-R snippets for such tasks. Pig makes it very easy to describe what data you want, and what you want to do with it. It compiles your Pig Latin script into a set of M-R applications and runs them on the cluster. A point that Chris stressed is that Pig Latin is not a query language, but rather a data flow description. Because it has an imperative style, the order of actions is more well-defined than say SQL.
I found Chris’ talk so engrossing that I wanted to join Yahoo! just to work on Pig. He’s a very good speaker and Pig strikes me as a very practical, and yet easily overlooked, part of the Hadoop system. For some reason that I have yet to understand, I fancy working on systems like this, the underlying infrastructure that most people (as in users, not developers) never see or give any thought to. Sure, working on GMail would be cool and all, but I’d rather develop something like Pig.
Next up was possibly the most “interesting” talk, by Michael Isard from Microsoft. Yeah, that Microsoft, the one that tried (and is still trying) to buy Yahoo! one way or another. He described his research project Dryad LINQ, or at least I think that’s what he works on. His description of Dryad was that of a system that analyzes a graph and performs a set of tasks over a distributed system. The graph describes the tasks, the data flow, and dependencies. It’s a very different approach than M-R, more general purpose. Naturally Dryad made a few trade-offs to improve overall performance. For instance, he believes in general Dryad performs very well, but it’s failure handling is rather inefficient. Hmm, interesting approach. I believe Google made the realization that in a large enough cluster, you are always going to have failures, so you had better deal with them gracefully and efficiently. And naturally these points came up during the Q&A section, to which he responded that he needs to come up with some comparison numbers. When asked if he’d looked at Hadoop in terms of performance, he flat out said “no”. Not a surprise there; frankly I’d be surprised if he even ran Hadoop once, let alone read any of the Google papers.
It goes without saying (but I’m saying it anyway) that all of this is implemented in, and on top of, Microsoft technologies (e.g. .NET, Windows). And you can surely bet that because it’s still in the research group, it will be a while before it sees the light of day, and it will most certainly not be open source in any reasonable way. One really funny part was some shill in the front row said “Well, I think judging by the reaction in the room, you’re kicking everybody’s butt, congratulations.” Um, yeah, I don’t think it was at all obvious that Dryad was better than Hadoop. They made certain choices and ended up with a very different system, with very different performance characteristics and features. It seemed to me that each node in Dryad was some arbitrary program, so they forfeited all of the advantages that M-R provides. Also, there’s no distributed file system (his actual response to a question from the audience). Presumably everything is stored in an SQLServer instance. Like I said, I really don’t see how that’s better than Hadoop.
For a pleasant change of pace, the next talk was about X-Trace, given by Andy Konwinski from UC Berkeley. He and Matei Zaharia (presumably) created hooks in Hadoop to enable monitoring events in the system as they occur within the cluster. Andy had a very appealing self-deprecating style, and made a few jokes about pretty graphs and dumb programmer mistakes, which warmed up the room after the rather dry and strange talk about Dryad. For instance, he and his colleagues used X-Trace to identify a silly configuration mistake they had made in Hadoop. They had set up 30 map workers but left the default number of reducers to 1, which caused their sample job to run for hours longer than it should have. This became blindingly obvious when they rendered a few graphs to show what was going on in the cluster. Clearly if you’re running into problems with Hadoop, X-Trace would be an excellent debugging tool. Andy gave another example in which a graph made very clear that one machine in particular was having disk problems, of a sort that impacted performance without necessarily taking the machine out of working order (and thus out of the cluster).
For the last talk before lunch, Ben Reed presented on the ZooKeeper project, which is both a distributed lock manager of sorts, and a distributed file system for very small files (everything is kept in main memory). It’s purpose is to facilitate configuration of the nodes in a cluster, enabling them to elect leaders and define membership, as well as serving as a name server. It’s actually very similar to Google’s Chubby lock service, just written in Java. Everything is stored in memory for fast response times, with a disk-based log, I assume for reconstructing the data if the node is restarted. The ZooKeeper team found that a system consisting of about three to five nodes works best. Fewer and reliability goes down; more and performance becomes an issue as the leader tries to keep all of the servers up-to-date.
Then there was lunch, which I’ll continue with in the next installment.