Last week, members of the SEASR team attended the Hadoop Summit & Big Data Computing Study Group. Bernie Acs and Xavier Llorà share the following report of the meeting:

The two events hosted by Yahoo! and sponsored by the Community Computer Consortium shared the common theme of data intensive computing. The first focused on the current state of Hadoop[1], which is an open source Java implementation of Google’s MapReduce [2] programming model and a supporting infrastructure that includes a distributed file system (HDFS). Based on GFS3 [3], HDFS is highly distributable and fault-tolerant. The second event was a symposium that featured several talks given by academic leaders and industry representatives who presented their work and perspectives on the types of computing resources that will be necessary to address the increases in data volumes that will be encountered.

A pervasive idea carried through both sessions was that significant internet companies like Google, Yahoo!, Amazon, and Microsoft are driving forward innovations and resources that will likely force new paradigms for how large computational environments should or could be developed and utilized. Below, we note the influence these companies are having on HPC:

* Google is making available to the educational community an HPC cluster that will be managed by IBM and allocated by NSF Christophe Bisciglia of Google was present at both sessions to serve as a speaker’s guest and to introduce this partnership).

* Yahoo! has implemented the largest production of a Hadoop infrastructure and has provided as open source the contributions developed at Yahoo! Research to enhance it. One is ZooKeeper [4], a highly available and reliable coordination system. Another is the Pig [5] scripting package.

* Amazon’s Elastic Computing Cloud (EC2) [6] service offering is providing virtual machine instances and virtualized clusters ready to run Hadoop applications.

* Microsoft Research is developing an infrastructure called “Dryad” [7], which is built using .NET framework, SQL Server 2005, and SQL Server Integration Services (SSIS) [8]. Dryad offers a highly scalable computing environment with fault-tolerance, scheduling, and job management.

Data-intensive computing software architecture paradigms are shifting rapidly with the concepts of cloud computing and the virtualization of computing nodes or even the virtualization of entire computing clusters. These factors are influencing the future of HPC computing concepts and will potentially help to define new areas of focus and gain traction with a variety research communities suffering from data overflow, like biology, genomics, astronomy, and other sciences. Hadoop Summit (March 25th, 2008).

Approximately 400 attendees were present to hear about the current status of the Hadoop infrastructure project, learn about the future directions of the project, and see showcased some real world applications that use Hadoop. A detailed agenda of speakers can be found here. Some of the most interesting application implementation talks were those presented by Kevin Beyer (IBM) on JAQL JSON Query Language, Andy Konwinski (UC Berkely) on X-trace debugger, Steve Schlosser (Intel) on ground modeling application, Mike Haley (Autodesk) on object library processing, and Dr. Jimmy Lin (U. Maryland) on a cluster computing course for natural language processing. Big Data Computing Study Group (March 26th, 2008).

The symposium was well attended by many distinguished participants. Speakers presented some impressive examples of large data research focuses, which are swimming in oceans of ever expanding data. Two presenters, Dr. Jill Mesirov (MIT &Harvard) and Dr. Alex Szalay (John Hopkins), talked about their respective science areas, how these fields are evolving with new technology, and how that is presenting new issues for managing exponential increases in data volumes. Jeannette Wing (NSF) introduced the Google/IBM/NSF partnership as a new and welcome innovation initiated by industry. Dr. Randy Bryant closed the session with the sentiment that services like those from Amazon EC2/S3 could lead to academia’s rethinking HPC ownership. A complete speaker agenda along with abstracts can be found here and detailed summaries of the talks can be found here.

References

[1] Hadoop infrastructure project pages.

[2] Google’s MapReduce was presented during OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Paper and Slides by Jeffrey Dean and Sanjay Ghemawat available here.

[3] “Google and the Wisdom of Clouds” article dated December 13, 2007, printed in the December 24, 2007 issue of Business Week subtitled “Google’s Next Big Dream”; article viewable here.

[4] Distributed applications use ZooKeeper to store and mediate updates for key configuration information and can be used for leader election, group membership, configuration maintenance, etc. Additional information availablehere.

[5] Pig is a language for processing data and a set of evaluation mechanisms for local execution or translation into a series of map reduction operations for execution by a Hadoop cluster. Additional information available here.

[6] Highlight page, service description, and pricing can be viewed here.

[7] Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting. Additional information and links are available at here.

[8] SQL Server Integration Services in SQL Server 2005 provides a scripting capability that allows developers to create complex data transformations, data mining, and manipulations using graphical interfaces and/or .NET scripting. Additional information is available here.

Leave a Reply