Archive for the ‘Community Building’ Category

The University of Victoria’s Digital Humanities Summer Institute (DHSI) was held on June 8-12, 2009. Loretta Auvil and Boris Capitanu taught the course entitled “SEASR in Action: Data Analytics for Humanities Scholar”. The slides and course materials for this workshop are at http://dev-tools.seasr.org/confluence/display/Outreach/DHSI-SEASR.

We had 15 students registered for the course. The course covered the following topics: Overview of SEASR infrastructure (components, flows, applications), Introduction to text mining tools, and Using and creating Zotero flows.

Loretta Auvil and Xavier Llorà of the SEASR Team participated in Bamboo Workshop 2 held October 16-18, 2008 in San Francisco, CA. Attendees participated in discussions.

SEASR co-PI Loretta Auvil will participate in the Mellon-funded Project Bamboo Workshop. With other higher education; museum and library; and organization, society, and agency leaders from across the U.S., she will attend the second session of The Planning Process & Understanding Arts and Humanities Scholarship workshop, which will be held from May 15-17, 2008 at the University of Chicago.

SEASR is twice mentioned in the Project Bamboo proposal, which sets as its goal formulating a strategic plan for enhancing the arts and humanities through the “development of shared technology services” (3). As one possible approach, the proposal recommends service-oriented architectures—such as SEASR’s—which emphasize ”being able to re-use and weave together loosely-coupled, discrete, specialized technology services that come from other providers and projects rather than building and managing all on one’s own.” The proposal goes on to say that “Critical to such an approach is the implementation of a web services framework. Such a framework is not a vertical application that focuses on a single in-depth function or a self-contained software tool used directly by a user, but rather a horizontally integrating set of technologies and set of core shared capabilities that enable the creation, aggregation, and reuse of services and resources among scholars, projects, and institutions” (15-16). The passage notes SEASR’s special strength in data analysis and mining tools.

In imagining a vision of the humanities researcher of the future and her work process, the Bamboo proposal turns to SEASR once again, envisioning a synthetic Bamboo composer that uses a visual programming environment similar to the one SEASR uses today in its workbench (20).

The SEASR and NEMA (Networked Environment for Music Analysis) teams have transformed a dynamic music classification explorer developed by IMIRSEL (The International Music Information Retrieval Systems Evaluation Laboratory) into a SEASR application that can be reused in whole or part by music researchers everywhere. Ira Fuchs–Vice President of Research in Information Technology for The Andrew W. Mellon Foundation (sponsor of SEASR and NEMA)–gave the “Son of Blinkie” (SoB) explorer its first demonstration on April 16th.

INTRODUCING SON OF BLINKIE

Innovations in digital technologies have changed the ways we create, access, analyze, share, and consume information. But to realize their full potential, we need to re-evaluate digital information technologies to consider whether their methods are hold-outs from the age of print and, if so, what improved means we can devise. IMIRSEL’s SoB [1, 2], a dynamic classification explorer for musical digital library users and researchers, offers such an advance to the way in which we access and analyze music.

In the print collections and their digital descendents, information is retrieved through metadata, or descriptive labels, imposed upon it by librarians, editors, and domain experts. This metadata is used to generate tables of contents, subject indexes, and other searchable formats. Once determined, such labels and their associated epistemologies tend to become fixed and accepted as fact; they present a closed system of established knowledge rather than provide a virtual landscape that encourages exploration and enables discovery.In developing Son of Blinkie—affectionately named after the earlier, simpler “Blinkie Thing” [3]—the researchers at IMIRSEL have sought to bring leading machine learning methods to bear on the problem of how to make better use of the now digital nature of music collections. They have developed a means for searching music automatically, using its features of composition rather than imposed metadata as a guide. Not only does this automated method improve the speed and accuracy of information retrieval, but it promises to enrich our understanding of music and its classification.

Faced with a collection of music, we often accept that the labels imposed by past listeners are accurate and/or informative. But listeners may hold conflicting opinions about a piece, and the piece itself may defy reductive labeling. Through analyzing a piece using its own compositional features, machine learning can help us to understand whether a given piece is representative of a genre or mood as a whole or to certain compositional tendencies within it, tendencies that may change over time, by performer, or even by performance. What’s more, Son of Blinkie (SoB) advances earlier attempts to automate digital music collection retrieval and analysis.

Consider the traditional train-test approach to building, evaluating, and using machine-generated audio-based classifications (e.g., genre, mood, artist, etc.) for Music Digital Libraries (MDL). It’s useful in some contexts, but has two serious shortcomings. First, the classifications are monic (i.e., only one class label per piece). This monicity ignores the fact that most music comprises a mix of moods and/or genres, etc. Second, the classifications are static (i.e., one class label per song) even though pieces evolve through several moods and/or genre mixes over their play time. The SoB system offers a new and superior method of digital music exploration, engineered to overcome train-test shortcomings and better capture the dynamic nature of music. SoB provides users with the capacity for highly configurable real-time classification, visualization, and audition.

Another important advancement made with SoB is that the application operates within SEASR’s service-oriented architecture, taking the form of a series of reusable, open-source components managed by and executed as a shareable workflow from SEASR’s community hub. Not only can users run SoB against their own data sets– with SEASR’s assistance in accepting different input formats stored on different platforms–but they can also reuse and revise components and workflows to build their own music research applications.

SON OF BLINKIE IN ACTION

SoB works by extracting a stream of features from audio tracks and applying a set of pre-trained classification models to short windows (10 sec.) of these features to generate posterior probability distributions in real-time. The display of the classification probabilities is synchronized with the audio playback, empowering users to dynamically explore the effects and interactions of an infinite number of parameters involved in automatic music classification. SoB permits users to select an arbitrary number of classification models from the system’s ever-growing model library. Currently SoB’s model library comprises two classification “task” collections: mood and genre classifiers.

sonofblinkieclassifiersm1.jpg

Above, we show a user simultaneously exploring the different real-time behaviors of mood classification models and genre classification models. Each model is making different predictions on this particular 5-second slice of the incoming, never-heard-before, song. The user can visualize the models’ prediction probability distributions, which can help the user better appreciate the potential “mixture” of moods present. The user can also listen to the synchronized audio to better understand the strengths/weaknesses of each model.

Below is a view that shows how data flows through the Son of Blinkie system, as it operates within SEASR (specifically, the semantic, web-driven dataflow execution environment portion of SEASR, which we have named Meandre). Each component represents one step in processing the data. The components run (and so process data) in the order established by the flow: from receiving the song filename and model filenames from the web application, to loading the audio and model data into memory, to extracting a variety of features from the song, to applying the model to the extracted features, to returning the predicted results to the SEASR community hub (a web application) for visualization. Every time a different song is selected, the web application executes this same flow.

sonofblinkieworkflowsm.jpg

REFERENCES

  1. Funded by The Andrew W. Mellon Foundation and the National Science Foundation (Grant No. NSF IIS-0327371). Thanks to M. C. Jones and the SEASR team for their technical assistance.
  2. IMIRSEL is directed by Dr. J. Stephen Downie, Graduate School of Library and Information Science (GSLIS), UIUC (jdownie@uiuc.edu). His Co-PIs on the Son of Blinkie system are Kris West, School of Computing Science, University of East Anglia and Xiao Hu, GSLIS, UIUC.
  3. Downie, J.S., Ehmann, A.F., and Tcheng, D. 2005: Real-time genre classification for music digital libraries. JCDL’05, 337.
  4. NEMA Website: http://nema.lis.uiuc.edu.
  5. SEASR Website: http://www.seasr.org.

Throughout March, SEASR and I-CHASS hosted humanities and social sciences research teams selected for their diversity of approach and interest:

Global Middle Ages: “Global middle ages” is a term in Medieval Studies that designates an interest in the middle ages across the world, i.e., non-Western societies.  The research group (Susan Noakes, French and Italian, Medieval Period, U. Minnesota; Geraldine Heng, Medieval and Women’s Studies, English, U. Texas-Austin; Ayhan Aytes, doctoral candidate in Communication at UC-San Diego, and also a medievalist) thus intends to create a digital resource that establishes and enriches researchers’ understanding of how non-Western societies contributed to medieval European culture (approximately 500-1450 ce).  The design for this project is centered on a mapped narrative of cultural influences coming out of Africa (e.g., the former provinces of Rome in the north, including Egypt; later, Islamicized Africa, especially Moorish civilization; and, later still, Western Africa as a site of empires as well as the transatlantic slave trade).  It will thus ground the historical for users through appeals to their temporal, visual, and spatial imaginations.  As with digital timelines, such mapped narratives tend to offer waypoints to users at which they can “stop” to browse in-depth information provided in a variety of media forms.

Peace and Nonviolence:  This project brings together researchers who have worked to promote peace and non-violence through informed activism.  They are uniformly interested in the social causes of violence.  Steven Valdivia, Independent Scholar (former Executive Director, Crisis Intervention Network-LA), and Fernando Hernandez, Education, CSU-Los Angeles (emeritus) are two researchers working on LA gangs.  They are especially interested in how governmental responses to poverty, minority status, and gang activity have fostered gang formation and violence.  They are seeking means of counteracting gang formation that might be recommended as public policies.  One theory they hope to prove is that the militarization of response to gang activity has worsened rather than improved gang violence.  The researchers from the Southern Poverty Law Center (Mark Potok and Heidi Beirich) are interested in research subjects that fit their civil rights mission, which the center pursues through its “tolerance education programs, its legal victories against white supremacists and its tracking of hate groups.”  They are especially concerned with researching the formation of hate groups (e.g. white supremacist), particularly how they hail new members.

Digital Portfolio Project:  Virginia Kuhn (Research Assistant Professor, Associate Director of the Institute of Multimedia Literacy, USC School of Cinematic Arts) has just led the first class through a new, intensive program at IML.  Their senior year culminates in a major multimedia design project, producing a finished piece with support work for each of the 30+ students.  Because archiving technology is increasingly available and because the new program is an important focus for the school, Dr. Kuhn wants to find a stable and innovative means for archiving these projects and retrieving information from them—with her ultimate goal being to produce a persistent, state-of-the-art pedagogical resource at USC, one that could serve as a model for other programs.  According to Dr. Kuhn’s official faculty bio, the “project was recently awarded a large (3 terabyte) allowance of storage space on SDSC’s TeraGrid.”  Consulting on the project are ISU’s Cheryl Ball, a specialist in digital composition and rhetoric (English) and Editor of Kairos, and Elijah Wright, a doctoral candidate at Indiana University’s School of Library and Information Science.

We are working with these teams to apply and further develop SEASR’s capabilities, and will feature their projects in a SEASR community-building workshop later this summer.

Last week, members of the SEASR team attended the Hadoop Summit & Big Data Computing Study Group. Bernie Acs and Xavier Llorà share the following report of the meeting:

The two events hosted by Yahoo! and sponsored by the Community Computer Consortium shared the common theme of data intensive computing. The first focused on the current state of Hadoop[1], which is an open source Java implementation of Google’s MapReduce [2] programming model and a supporting infrastructure that includes a distributed file system (HDFS). Based on GFS3 [3], HDFS is highly distributable and fault-tolerant. The second event was a symposium that featured several talks given by academic leaders and industry representatives who presented their work and perspectives on the types of computing resources that will be necessary to address the increases in data volumes that will be encountered.

A pervasive idea carried through both sessions was that significant internet companies like Google, Yahoo!, Amazon, and Microsoft are driving forward innovations and resources that will likely force new paradigms for how large computational environments should or could be developed and utilized. Below, we note the influence these companies are having on HPC:

* Google is making available to the educational community an HPC cluster that will be managed by IBM and allocated by NSF Christophe Bisciglia of Google was present at both sessions to serve as a speaker’s guest and to introduce this partnership).

* Yahoo! has implemented the largest production of a Hadoop infrastructure and has provided as open source the contributions developed at Yahoo! Research to enhance it. One is ZooKeeper [4], a highly available and reliable coordination system. Another is the Pig [5] scripting package.

* Amazon’s Elastic Computing Cloud (EC2) [6] service offering is providing virtual machine instances and virtualized clusters ready to run Hadoop applications.

* Microsoft Research is developing an infrastructure called “Dryad” [7], which is built using .NET framework, SQL Server 2005, and SQL Server Integration Services (SSIS) [8]. Dryad offers a highly scalable computing environment with fault-tolerance, scheduling, and job management.

Data-intensive computing software architecture paradigms are shifting rapidly with the concepts of cloud computing and the virtualization of computing nodes or even the virtualization of entire computing clusters. These factors are influencing the future of HPC computing concepts and will potentially help to define new areas of focus and gain traction with a variety research communities suffering from data overflow, like biology, genomics, astronomy, and other sciences. Hadoop Summit (March 25th, 2008).

Approximately 400 attendees were present to hear about the current status of the Hadoop infrastructure project, learn about the future directions of the project, and see showcased some real world applications that use Hadoop. A detailed agenda of speakers can be found here. Some of the most interesting application implementation talks were those presented by Kevin Beyer (IBM) on JAQL JSON Query Language, Andy Konwinski (UC Berkely) on X-trace debugger, Steve Schlosser (Intel) on ground modeling application, Mike Haley (Autodesk) on object library processing, and Dr. Jimmy Lin (U. Maryland) on a cluster computing course for natural language processing. Big Data Computing Study Group (March 26th, 2008).

The symposium was well attended by many distinguished participants. Speakers presented some impressive examples of large data research focuses, which are swimming in oceans of ever expanding data. Two presenters, Dr. Jill Mesirov (MIT &Harvard) and Dr. Alex Szalay (John Hopkins), talked about their respective science areas, how these fields are evolving with new technology, and how that is presenting new issues for managing exponential increases in data volumes. Jeannette Wing (NSF) introduced the Google/IBM/NSF partnership as a new and welcome innovation initiated by industry. Dr. Randy Bryant closed the session with the sentiment that services like those from Amazon EC2/S3 could lead to academia’s rethinking HPC ownership. A complete speaker agenda along with abstracts can be found here and detailed summaries of the talks can be found here.

References

[1] Hadoop infrastructure project pages.

[2] Google’s MapReduce was presented during OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Paper and Slides by Jeffrey Dean and Sanjay Ghemawat available here.

[3] “Google and the Wisdom of Clouds” article dated December 13, 2007, printed in the December 24, 2007 issue of Business Week subtitled “Google’s Next Big Dream”; article viewable here.

[4] Distributed applications use ZooKeeper to store and mediate updates for key configuration information and can be used for leader election, group membership, configuration maintenance, etc. Additional information availablehere.

[5] Pig is a language for processing data and a set of evaluation mechanisms for local execution or translation into a series of map reduction operations for execution by a Hadoop cluster. Additional information available here.

[6] Highlight page, service description, and pricing can be viewed here.

[7] Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting. Additional information and links are available at here.

[8] SQL Server Integration Services in SQL Server 2005 provides a scripting capability that allows developers to create complex data transformations, data mining, and manipulations using graphical interfaces and/or .NET scripting. Additional information is available here.

SEASR cyberinfrastructure technology will be leveraged by the Cultural Informatics project of UIUC’s Institute for Advanced Computing Applications and Technologies (IACAT), which combines a faculty-initiated research in its academic units with the advanced technology capabilities at NCSA.

Cultural Informatics, led by Michael Ross (Krannert Center for the Performing Arts), will “apply information science and technology to the creation and comprehension of human experience, to the understanding and expression of the human condition, and to the revelation and communication of human values and meaning. This may include the creation of new aesthetic works, public engagement, formal and informal education, the performing arts, museum and other exhibition venues, and design strategies that affect society.” Donna Cox (NCSA) and Guy Garnett (Music/Seedbed Initiative) serve as co-PIs.

SEASR recently attended The Andrew W. Mellon Research in Information Technology retreat, held on the Princeton University campus (February 28-29, 2008). The retreat gave us the opportunity to strategize our approach to sustainability and outreach with other project leaders, as well as to share our progress.

Given here from the retreat report, our project’s technical highlights are these:

SEASR’s adoption and sustainability depends on providing tools strategized to meet the digital humanities and humanities communities’ needs and crafted to operate efficiently and effectively. Over the past six months, we have assembled an outstanding development team and embarked on the journey of designing and building this transformational technology. The team has developed key infrastructure architecture with a semantic web-driven data flow execution environment as well as a developer workbench to create the flows. We have created two important core functionalities: 1) a self-contained execution environment and 2) the ability to define extensions for executing components in languages other than Java. Extensions have already been created for python and common lisp. We have also begun migrating Nora, MONK, and M2K components to SEASR, in addition to the integration of some existing tools, like D2K, Weka, and UIMA.

Our community-building efforts are as well underway as our technical development. Again, from the retreat report:

Because SEASR is a cyberinfrastructure project, we have targeted computational humanists as our primary community, with traditional humanists as a larger, secondary community. To create a community for SEASR from these potential bases of support, we have participated in conferences to advertise the project, network, and gain feedback (see marketing/evangelism); gathered functional, data-related, user interface and usability requirements; met with local advisors (John Unsworth, Kevin Franklin, Vernon Burton, Stephen Downie, Donna Cox); engaged in collaborative workshop planning, maintained project partnerships; and grown our network through follow-up contacts and partnership discussions (see synergy with other projects). Not only are our project advisors active members in SEASR’s constituent communities, but our partner projects also connect us to developers and researchers at many institutions. At MONK, for example, we work closely with, among others, Martin Mueller (Northwestern U.), Catherine Plaisant (U. Maryland), Matthew Kirschenbaum (U. Maryland), Steve Ramsey (U. Nebraska), Stan Ruecker (U. Alberta), Stefan Sinclair (McMaster U.). Our future collaboration with NEMA will involve Stephen Downie (UIUC), Ichiro Fujinaga (McGill U.), David DeRoure (U. Southampton, UK), Mark Sandler (Queen Mary, U. London, UK), Tim Crawford (Goldsmiths, U. London, UK), and David Bainbridge (U. Waikato, NZ). […]

SEASR’s 2007 marketing efforts include a website and conference participation in the US and UK aimed at identifying user needs, promoting the project, networking within the digital humanities community, and identifying and engaging research collaborators in technology and humanities scholarship. These conferences were: HASTAC Conference (April 19-21, 2007, Durham, NC), e-Science for Arts and Humanities Research: An Early Adopters’ Forum (June 1-2, 2007, Urbana, IL), Digital Humanities 2007 (June 4-7, 2007, Urbana, IL): SEASR BOF, UK e-Science All Hands Meeting 2007 (September 9-13,2007, Nottingham, England): SEASR presentation, Third International Conference on E-Social Science (October 8-9, 2007, Ann Arbor, MI), Chicago Digital Humanities Colloquium (October 21-22, 2007, Chicago, IL), IEEE VIS 2007 (October 27-November 1, 2007, Sacramento, CA), Service Oriented Computing in the Humanities (December 17-18, 2007, London, England): SEASR presentation. In addition, we have actively participated in the MONK project, including weekly collaborative cell calls, a hackfest, and an All Hands meeting. In the coming year, we will continue this pattern of presenting the project, networking with members of the community, contributing to partner projects, and engaging new partners and researchers.

Complete retreat reports from participating projects, including SEASR, are given here.

Tags:

Loretta Auvil and Amit Kumar participated in MONK’s latest Hackfest (February 7-10, 2008, Chicago).

In preparation for the meeting, Peter Groves produced an icon to suggest how well a particular file or feature contributes to supervised classification, a feature MONK anticipates adding to the feature display in the Search by Example toolset. At the meeting, Amit Kumar (who is tasked with developing the MONK workbench) and other MONKies connected new proxy calls through the workbench, which will include SEASR calls. Loretta Auvil started toward an unsupervised classification of the TEI-A verion of witchcraft files through SEASR, to advance research for Dr. Kirsten Uszkalo’s use case.

At meeting’s end, the MONK team requested that SEASR develop a clustering tool written in Google Web Toolkit, to be tested on the Nineteenth-Century Fiction and Witchcraft databases.

Stuart Dunn and Tobias Blanke discuss SEASR in their report of the UK e-Science All Hands 2007 meeting published in D-Lib Magazine (January/February 2008, Vol. 14 No. 1/2), “Next Steps for E-Science, the Textual Humanities and VREs: A Report on Text and Grid: Research Questions for the Humanities, Sciences and Industry.”

Of SEASR, the authors write, “Thinking in terms that reach beyond conventional library frameworks highlights a need to consider the process by which unstructured data becomes structured. This was the primary issue considered by Loretta Auvil from the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, who presented on the Software Environment for the Advancement of Scholarly Research (SEASR) project. This API-driven approach enables analyses run by text mining tools, such as NoraVis (http://www.noraproject.org/description.php) and Featurelens (http://www.cs.umd.edu/hcil/textvis/featurelens/) to be published to web services. This is critical: a VRE that is based on digital library infrastructure will have to include not just text, but software tools that allow users to analyse, retrieve (elements of) and search those texts in ever more sophisticated ways. This requires formal, documented and sharable workflows, and mirrors needs identified in the hard science communities, which are being met by initiatives such as the myExperiment project (http://www.myexperiment.org). A key priority of this project is to implement formal, yet sharable, workflows across different research domains. As different research domains have very different protocols for structuring and managing textual archives, the utility of being able to use tools such as Nora and Featurelens in a SEASR-type environment will become ever more important in the development of VREs for textual studies. For example, a numerical extraction system like that presented by the Open Boek project has significant utility when applied to archaeological reports, but such utility is clearly not confined to that domain. In the scientific communities, there has been interest in digital versions of lab books in VREs (http://www.vre.ox.ac.uk/ibvre/index.xml.ID=evaluation). Numeric data is likely to be critical to such exercises. Like Open Boek, the JISC-funded Integrative Biology VRE project was also concerned with the textual context of numbers: it found that digital recognition of equations was a significant problem, a clear case of crossover. Such analyses could, in theory, be delivered to the user by an architecture like that described by Auvil.”

The authors conclude, “[…]Although Web 2.0 has not revolutionized scholarly research in the way envisaged originally, researchers need to be able to annotate texts on which they are working, and to be able to store, search and structure those annotations. In a way, such a structure might resemble a (user-created) digital library within or across other digital libraries. Detailed semantic documentation of the links between the annotation and the annotated text is necessary, along with documentation of when, why and by whom the annotation was created. Furthermore, it would be highly desirable for any additional chunks from separate texts that may be relevant to the annotation (e.g., containing the same name, geographic reference, numeric data, etc.) to be identified: the workflow management architectures presented both by SEASR and GATE suggest this is possible.”