Archive for the ‘SEASR Team’ Category

A new version (0.3.3) of the SEASR Analytics for Zotero Firefox plugin has been recently released. The new version adds support for the beta release of the upcoming Zotero 1.5 The new release can be downloaded here. This release is backward compatible with earlier Zotero versions.

SEASR co-PI Loretta Auvil will participate in the Mellon-funded Project Bamboo Workshop. With other higher education; museum and library; and organization, society, and agency leaders from across the U.S., she will attend the second session of The Planning Process & Understanding Arts and Humanities Scholarship workshop, which will be held from May 15-17, 2008 at the University of Chicago.

SEASR is twice mentioned in the Project Bamboo proposal, which sets as its goal formulating a strategic plan for enhancing the arts and humanities through the “development of shared technology services” (3). As one possible approach, the proposal recommends service-oriented architectures—such as SEASR’s—which emphasize ”being able to re-use and weave together loosely-coupled, discrete, specialized technology services that come from other providers and projects rather than building and managing all on one’s own.” The proposal goes on to say that “Critical to such an approach is the implementation of a web services framework. Such a framework is not a vertical application that focuses on a single in-depth function or a self-contained software tool used directly by a user, but rather a horizontally integrating set of technologies and set of core shared capabilities that enable the creation, aggregation, and reuse of services and resources among scholars, projects, and institutions” (15-16). The passage notes SEASR’s special strength in data analysis and mining tools.

In imagining a vision of the humanities researcher of the future and her work process, the Bamboo proposal turns to SEASR once again, envisioning a synthetic Bamboo composer that uses a visual programming environment similar to the one SEASR uses today in its workbench (20).

Loretta Auvil was invited to present the keynote address at the Text Mining Workshop 2008, which was held in conjunction with the Eighth SIAM International Conference on Data Mining (SDM 2008) in Atlanta, GA on April 26, 2008.  Her presentation title echoes SEASR’s identifying phrase, “Engineering Knowledge for the Humanities.”

Presentation


Abstract

Over the last decade NCSA’s Automated Learning Group has innovated data mining technologies for industry, government, and the sciences. In the past few years, we have broadened our focus to include knowledge discovery in the humanities. My presentation will focus on how we are negotiating humanities computing’s special challenges for data mining and analysis. I will discuss our early collaborative projects, FeatureLens and Nora, and SEASR (Software Environment for the Advancement of Scholarly Research), the Andrew W. Mellon Foundation-funded project we are now leading. Each of these projects has developed technologies customized to meet specific needs of the digital humanities community. FeatureLens–an early MONK (Metadata Offer New Knowledge) application–uses the machine learning approach of frequent pattern mining to identify fuzzy repetition patterns in a data collection, and with no initial human input. Nora–a case study for eighteenth- and nineteenth-century British and American literature–uses predictive modeling techniques to classify documents, even given complex and notoriously indistinct expert classes such as sentimental fiction. SEASR is our most ambitious project yet, employing a semantic-based, service-oriented architecture to build software bridges that allow users to access data stored in disparate formats and on incompatible platforms and to provide an enhanced environment for workflow and data sharing. The essential infrastructure SEASR provides will advance the capabilities of projects like our partner, MONK, a digital environment designed to help humanities scholars discover and analyze patterns.

Last week, members of the SEASR team attended the Hadoop Summit & Big Data Computing Study Group. Bernie Acs and Xavier Llorà share the following report of the meeting:

The two events hosted by Yahoo! and sponsored by the Community Computer Consortium shared the common theme of data intensive computing. The first focused on the current state of Hadoop[1], which is an open source Java implementation of Google’s MapReduce [2] programming model and a supporting infrastructure that includes a distributed file system (HDFS). Based on GFS3 [3], HDFS is highly distributable and fault-tolerant. The second event was a symposium that featured several talks given by academic leaders and industry representatives who presented their work and perspectives on the types of computing resources that will be necessary to address the increases in data volumes that will be encountered.

A pervasive idea carried through both sessions was that significant internet companies like Google, Yahoo!, Amazon, and Microsoft are driving forward innovations and resources that will likely force new paradigms for how large computational environments should or could be developed and utilized. Below, we note the influence these companies are having on HPC:

* Google is making available to the educational community an HPC cluster that will be managed by IBM and allocated by NSF Christophe Bisciglia of Google was present at both sessions to serve as a speaker’s guest and to introduce this partnership).

* Yahoo! has implemented the largest production of a Hadoop infrastructure and has provided as open source the contributions developed at Yahoo! Research to enhance it. One is ZooKeeper [4], a highly available and reliable coordination system. Another is the Pig [5] scripting package.

* Amazon’s Elastic Computing Cloud (EC2) [6] service offering is providing virtual machine instances and virtualized clusters ready to run Hadoop applications.

* Microsoft Research is developing an infrastructure called “Dryad” [7], which is built using .NET framework, SQL Server 2005, and SQL Server Integration Services (SSIS) [8]. Dryad offers a highly scalable computing environment with fault-tolerance, scheduling, and job management.

Data-intensive computing software architecture paradigms are shifting rapidly with the concepts of cloud computing and the virtualization of computing nodes or even the virtualization of entire computing clusters. These factors are influencing the future of HPC computing concepts and will potentially help to define new areas of focus and gain traction with a variety research communities suffering from data overflow, like biology, genomics, astronomy, and other sciences. Hadoop Summit (March 25th, 2008).

Approximately 400 attendees were present to hear about the current status of the Hadoop infrastructure project, learn about the future directions of the project, and see showcased some real world applications that use Hadoop. A detailed agenda of speakers can be found here. Some of the most interesting application implementation talks were those presented by Kevin Beyer (IBM) on JAQL JSON Query Language, Andy Konwinski (UC Berkely) on X-trace debugger, Steve Schlosser (Intel) on ground modeling application, Mike Haley (Autodesk) on object library processing, and Dr. Jimmy Lin (U. Maryland) on a cluster computing course for natural language processing. Big Data Computing Study Group (March 26th, 2008).

The symposium was well attended by many distinguished participants. Speakers presented some impressive examples of large data research focuses, which are swimming in oceans of ever expanding data. Two presenters, Dr. Jill Mesirov (MIT &Harvard) and Dr. Alex Szalay (John Hopkins), talked about their respective science areas, how these fields are evolving with new technology, and how that is presenting new issues for managing exponential increases in data volumes. Jeannette Wing (NSF) introduced the Google/IBM/NSF partnership as a new and welcome innovation initiated by industry. Dr. Randy Bryant closed the session with the sentiment that services like those from Amazon EC2/S3 could lead to academia’s rethinking HPC ownership. A complete speaker agenda along with abstracts can be found here and detailed summaries of the talks can be found here.

References

[1] Hadoop infrastructure project pages.

[2] Google’s MapReduce was presented during OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Paper and Slides by Jeffrey Dean and Sanjay Ghemawat available here.

[3] “Google and the Wisdom of Clouds” article dated December 13, 2007, printed in the December 24, 2007 issue of Business Week subtitled “Google’s Next Big Dream”; article viewable here.

[4] Distributed applications use ZooKeeper to store and mediate updates for key configuration information and can be used for leader election, group membership, configuration maintenance, etc. Additional information availablehere.

[5] Pig is a language for processing data and a set of evaluation mechanisms for local execution or translation into a series of map reduction operations for execution by a Hadoop cluster. Additional information available here.

[6] Highlight page, service description, and pricing can be viewed here.

[7] Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting. Additional information and links are available at here.

[8] SQL Server Integration Services in SQL Server 2005 provides a scripting capability that allows developers to create complex data transformations, data mining, and manipulations using graphical interfaces and/or .NET scripting. Additional information is available here.

During Summer 2007, SEASR team member–and Automated Learning Group (NCSA) and Illinois Genetic Algorithms Lab Research Scientist–Xavier Llorà received two Bronze Humies at GECCO 2007 (the Genetic and Evolutionary Computation Conference). Dr. Llorà also received two Best Paper awards at international conferences within a month of one another–a signal accomplishment.

First, the Humies: Dr. Llorà and NCSA faculty fellow Rohit Bhargava (Bioengineering and Beckman, UIUC), with the support of students Rohith Reddy (Bioengineering and Beckman, UIUC) and Brian Matesic (Bioengineering, UIUC), received a Bronze Humie for “Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infrared Spectroscopic Imaging.” The team used a novel genetics-based machine learning technique (NAX) to diagnose prostate cancer. Their innovative data handling and analysis strategies demonstrate fast learning and accurate classification that scales well with parallelization. For the first time, an automated discovery method has performed as accurately in predicting prostate cancer as human experts.

Along with co-authors Jaume Bacardit (ASAP research group, School of Computer Science and IT, U. Nottingham), Michael Stout (ASAP research group, School of Computer Science and IT, U. Nottingham), Jonathan D. Hirst (School of Chemistry, U. Nottingham), Kumara Sastry (Illinois Genetic Algorithms Laboratory, UIUC), and Natalio Krasnogor (ASAP research group, School of Computer Science and IT, U. Nottingham)—Dr. Llorà was also awarded a Bronze “Humie” for “Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein Structure Prediction.” The paper demonstrates that certain automated procedures can be used to reduce the size of the amino acid alphabet used for protein structure prediction from twenty to just three letters with no significant loss of accuracy. This discovery has the potential for enabling a faster and easier learning process, as well as for generating more compact and human-readable classifiers.

The Humies are awarded annually to recognize human-competitive results produced by genetic and evolutionary computation. Each Bronze award carries a $1000 prize.

____________

Next, Best Papers: Dr. Llorà’s award-winning papers were “Toward Billion Bit Optimization via Parallel Estimation of Distribution Algorithm” at GECCO 2007 (the Genetic and Evolutionary Computation Conference) for the Estimation of Distribution Algorithms track in London, England this July—which Dr. Llorà co-authored with NCSA Research Scientist David E. Goldberg (who also serves as Jerry S. Dobrovolny Distinguished Professor in Entrepreneurial Engineering and Director of Illinois Genetic Algorithms Laboratory, UIUC) and Kumara Sastry (Illinois Genetic Algorithms Laboratory, UIUC)—and “Delineating Topic and Discussant Transitions in Online Collaborative Environments” for the overall conference at ICEIS 2007 (International Conference on Enterprise Information Systems) in Funchal, Madeira, Portugal this June. Noriko Imafuji Yasu, a postdoctoral fellow at Illinois Genetic Algorithms Laboratory; NCSA Research Scientist and Professor David E. Goldberg; and marketing researchers Yuichi Washida and Hiroshi Tamura co-authored the paper.

“Toward Billion Bit Optimization via Parallel Estimation of Distribution Algorithm” takes on a major, open problem in the field of genetic algorithms, which is devoted to search procedures based on the mechanics of natural selection and genetics that, since the mid-1980s, have been used increasingly to find answers to important scientific problems. Until now, genetic algorithms have been criticized as being slow, suitable for optimizing problems with only a few variables. Experts have believed genetic algorithms could not scale to help solve larger, more complex problems. However, Dr. Llorà and his co-authors show that genetic algorithms—by utilizing a number of memory and computational efficiencies— can be scaled to present principled solutions to solve boundedly difficult, large scale problems with millions to billions of binary variables. Moreover, they showed that their fully parallelized, highly-efficient compact genetic algorithm was able to do so against a class of additively separable problems even with additive noise, when local search methods failed to do so in the presence of just a modest amount.

“Delineating Topic and Discussant Transitions in Online Collaborative Environments” details a new algorithmic method for analyzing discussion dynamics and social networking in online collaborative environments (in this case, focus group discussions for product conceptualization), a relatively new and important domain of social and consumer communications research. The team developed an algorithm named KEE (Key Elements Extraction), which applies the HITS (Hyperlink-Induced Topic Search) algorithm (Kleinberg, 1999) in an unintended way: using the HITS algorithm for textmining rather than dividing web pages into hubs and authorities. The KEE algorithm assumes a mutually reinforcing relationship between participants and terms, defining significant participants as those who use many significant terms and significant terms as those used by many significant participants.

Employing real discussion data, the team determined that the KEE algorithm provided a better understanding and depiction of participants’ ideas than the traditional TF-IDF (term frequency-inverse document frequency) method. Moreover, since key terms were associated with key persons in the discussion, the terms themselves already conveyed the potential knowledge sought. These results, in which discussion dynamics analysis and social network analysis produced significant knowledge essential to decision support, demonstrate the KEE algorithm’s effectiveness for network- and text-based communication analysis. The KEE algorithm research is associated with the Illinois Genetic Algorithms Laboratory’s DISCUS project, which targets innovation support through network-based communication, using two other chance discovery methods, KeyGraph (Ohsawa, Benson, & Yachida, 1998) and influence diffusion models (IDM) (Matsumura, Ohsawa, & Ishizuka, 2002). Next, the team plans to use the KEE algorithm for knowledge discovery in web-logs or web forums.

SEASR is indeed fortunate to have such a gifted researcher and collaborator on our team!

__________

References

Bacardit, Jaume, Michael Stout, Jonathan D. Hirst, Kumara Sastry, Xavier Llorà, and Natalio Krasnogor, “Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein Structured Prediction,” in Proceedings of the 2007 GECCO Conference Companion on Genetic and Evolutionary Computation, London, England, United Kingdom, ACM, 2007.

Goldberg, David E., Kumara Sastry, and Xavier Llorà, “Towards Billion Bit Optimization via Efficient Genetic Algorithms,” in Proceedings of the 2007 GECCO Conference Companion on Genetic and Evolutionary Computation, London, England, United Kingdom, ACM, 2007. Also published in Complexity, 12 (3), 27-29.

Kleinberg, Jon M.(1999), “Hubs, Authorities, and Communities,” ACM Computing Surveys, 31 (4), No. 5.

Llorà, Xavier, Rohith Reddy, Brian Matesic, and Rohit Bhargava, “Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infrared Spectroscopic Imaging,” in Proceedings of the 2007 GECCO Conference Companion on Genetic and Evolutionary Computation, London, England, United Kingdom, ACM, 2007.

Matsumura, Naohiro, Yukio Ohsawa, and Mitsuru Ishizuka (2002), “Automatic Indexing for Extracting Asserted Keywords from a Document,” New Generation Computing, 21(1), 37-48.

Ohsawa, Yukio, Nels E. Benson, and Masahiko Yachida (1998), “KeyGraph: Automatic Indexing by Co-Occurrence Graph Based on Building Construction Metaphor,” ADL, 12-18.

Yasui, Noriko Imafuji, Xavier Llorà, David E. Goldberg, Yuichi Washida, and Hiroshi Tamura, “Delineating Topic and Discussant Transitions in Online Collaborative Environments,” in Proceedings of the Tenth International Conference on Enterprise Information Systems, Funchal, Madeira, Portugal, ICEIS Press, 2007