Bioinformatics Open Source Conference 2002

Table of Contents
Welcome
Abstracts

Welcome

Welcome to BOSC 2002! This is the 3rd official Bioinformatics Open Source Meeting and the first since the creation of the Open Bioinformatics Foundation. We're very excited to have full days of talks this year with 3 keynote speakers to talk about the open source principals in the context of their scientific work. We hope the field of invited speakers will provide a broad perspective of ongoing projects that focus on needs in the community from infrastructure and middle layers to user applications to novice user education to broader treatments of open source principals in the academic field. A poster session will occur on the afternoon of August 1 please come by and discuss the poster presentations with the authors. Software demonstrations will occur as well during this period. In the evenings we have scheduled space for Birds of a Feather (BoF) discussions, please take advantage of this time to attend discussion on specialized topics. If you would like to schedule a BoF see the signup chart that will be available in the mornings. If you have any questions or concerns about the conference please let one of the conference committee member know.

We hope you enjoy yourself, learn a lot, and most importantly get to know each other and become part of the community of open source development in the life sciences.

Conference Committee

Jason Stajich (chair)	Duke University
Ewan Birney	European Bioinformatics Institute
Chris Dagdigian	BioTeam.Net
Andrew Dalke	Dalke Scientific Software
Nomi Harris	University of California, Berkeley

Abstracts

Cancer Bioinformatics Infrastructure Objects (caBIO): An open-source, object oriented API for biomedical informatics

Peter A. Covitz, Himanso Sahni, Scott Gustafson, and Kenneth Buetow; National Cancer Institute Center for Bioinformatics

August 1, 11:25-11:35

The National Cancer Institute has established a Center for Bioinformatics (NCICB) whose mission is to support the NCI's programs in basic and clinical cancer research. The NCICB is aggressively pursuing a program to develop a core infrastructure and API for biomedical information management and retrieval. The initiative employs industry-standard software engineering methodologies to develop data models, middleware, vocabularies and ontologies for biomedical research.

caBIO is the primary programming interface to the core infrastructure. caBIO objects are implemented using Java and Java Bean technology, and represent biological and laboratory entities such as genes, chromosomes, sequences, libraries, clones, pathways, and ontologies. caBIO provides uniform API access to a variety of genomic, biological, and clinical data sources including GenBank, Unigene, LocusLink, Homologene, Ensembl, Golden Path, DAS servers, CGAP, NCI Enterprise Vocabulary Services, and clinical trials protocols. Any client can retrieve HTML and XML from caBIO via HTTP. Java-based clients can further communicate with caBIO via the domain objects provided by the caBIO JAR, while server components can communicate via Java RMI. Non-Java applications can communicate via SOAP. RDF is currently used to advertise services to crawlers and agents, and a UDDI registry is planned. For its presentation layer, caBIO uses servlets and JSPs under Jakarta Tomcat. All caBIO objects can be transformed into XML, and XSL/XSLT is used to present data in documents, web pages or other interfaces.

NCICB makes the caBIO interfaces available on its public servers, and also makes the underlying software available for use at at local sites. More information is available at http://ncicb.nci.nih.gov/core". The open source license covering caBIO software can be found at http://ncicb.nci.nih.gov/core/caBIO/developer_resources/core_jar/license.

Biopython and the Laboratory Scientist

Brad Chapman, University of Georgia

August 1, 11:35-12:00

Biopython is a collection of open-source tools in the Python programming language. Developed by a collection of programmers from around the world, the Biopython toolkit is designed to provided re-usable code for anyone answering biological questions using Python. Biopython has been around since 1999, and has a number of active contributors and users. In this talk, I will briefly describe the basic components provided in the Biopython toolkit. From there, I will describe how Biopython can be used in a academic laboratory environment, taking examples from my own lab. The emphasis will be on utilization of Biopython code for automating everyday tasks faced by wet lab researchers. I will try to show that Python and Biopython can be used productively by researchers lacking formal training in computer science. Finally, I will describe integrating Biopython into larger bioinformatics projects. Again, this will draw on my own experience using Biopython and will describe how using Biopython can help make your coding life easier when approaching a large project. The aim of the entire talk it to convince you that using open source libraries like Biopython is worth the time invested in learning it.

The Open Source Authors' Contract

Steven Brenner, University of California, Berkeley

August 1, 1:50-2:15

Most universities, national laboratories, companies, and other employers have clauses in their employment contracts that prevent or restrict the creation and use of open source software. Indeed, it seems likely that much of the biological open source software is being produced illegally, in violation of institutions' terms. While benign neglect of enforcement of the institutions' regulations has led to a situation that is generally acceptable, it is not ideal.

Several individuals have sought the ability to produce open source software by seeking exemptions or variations of their institutions' intellectual property agreements. However, this is a painstaking process, and the associated legal fees can be costly. I propose that a general contract be drawn up, which has standard terms for individuals to create open source software without undue constraints.

Since this idea was first broached a year ago, there has been widespread discussion regarding regulations governing production open source software. This talk will provide a background to the motivation for the Authors' contract, as well as recent responses which suggest productive ways forward.

BioJava Toolkit Progress

Matthew Pocock, BioJava Consulting Limited

August 1, 2:15-2:40

BioJava is an open-source software project that aims to provide an industry-quality Java library for common bioinformatics tasks. BioJava is part of the open-bio foundation. BioJava was started in the autumn of 1998, and now has over 25 developers. In the past two years, the core development team has expanded from the original team of two to five. This has brought with it a greater range of views and expertise, as well as a greater stability. In parallel with this, we are in the process of integrating unit testing to maintain the quality of the >130,000 lines of code and documentation in the core library.

BioJava has taken an active role in participating in the open-bio hackathons. Representatives have attended both legs of the hackathon (Tuscon, AZ, USA and Cape Town, SA). During this time, several important interoperable technologies were designed and implemented. These include a registry file format for biological entities, an SQL schema for storage of sequences and their annotations, BioCorba-based CORBA clients and servers, bibliographic web services, web services for publishing sequence data and flat file indexing. All of these have been implemented in BioJava, and interoperate with implementations in the other open-bio language projects, as well as with some external implementations.

Over the next year, we hope to mature the library's functionality in areas related to sequence manipulation, pipeline management, alignments, Sequence GUIs and file parsers. In parallel, we shall be integrating code-generation, more flexible transaction management and ontology representations with the current free-form annotation model and BioJava interfaces to allow the representation of more fluid data types, and more maintainable and robust implementation of standard interfaces.

GOET: the General Ontology Editing Tool

John Richter, Berkeley Drosophila Genome Project

August 1, 2:40-2:50

GOET is a Java application designed to facilitate the creation of ontology schemas and data. GOET allows a user to define DAML+OIL-like schemas and then populate those schemas with data. Data can loaded from and saved to DAML+OIL flat files, as well as numerous other formats.

GOET is highly customizable via pluggable editor kits. Editor kits are Java jar files that define a custom user interface for GOET, tailored to a particular kind of data. Editor kits allow programmers to create the most efficient user interface for any given ontology. GOET comes with a generic editor kit that can edit any ontology, making it easy for users to experiment with new schemas.

GOET provides a strong toolkit for ontology editing, with automatic support for history tracking, undo/redo, cycle checking, and other important graph editing tools. This toolkit makes it easy for programmers to develop new, powerful editor kits.

Other information:

GOET is being developed as part of the gmod project at http://sourceforge.net/projects/gmod.

Like all gmod components, GOET is distributed under the terms of the Artistic License.

GHMM & HMMed: A comprehensive HMM toolkit

Alexander Schliep, Max-Planck-Institut for Molecular Genetics

August 1, 2:50-3:00

Hidden Markov Models (HMMs) are one of the most successful tools for analyzing biological sequences.

We have developed a graphical editor for HMMs called HMMEd which allows to create sophisticated models manually using a graphical user interface. Hierarchical models are supported (e.g. a three state model representing a single codon as one 'super state'), as well as a wide range of HMM extensions and user data associated with the states of the HMM. Graphical editors for discrete emission distributions as well as mixtures of continuous pdfs are integrated.

For the exchange of HMMs we propose a XML-based format which is loosely based on GraphML, is hierarchical and also incorporates necessary extensions for proper graphical display.

The GNU (pending permission from the FSF) HMM library (GHMM) is a C-library providing efficient implementations of a comprehensive collection of algorithms for both discrete and continuous emission HMMs. Python bindings allow interactive work with HMMs from the Python command line and, at some later stage, tight integration with HMMEd, which is also written in Python using Tkinter.

HMMEd (pronounced Hammered) and the GHMM are licensed under the LGPL.

Usability

Andrew Dalke, Dalke Scientific Software

August 1, 3:00-3:10

Open source software is often said to be "unusable." On the surface this doesn't make sense because many of the projects are widely used to do real work. But usability isn't a binary value, it refers to ease of use. Two packages can be equally featureful but one be much more usable than the other.

A lot of research has gone into understanding how to make more usable software. This knowledge is starting to make its way into mainstream software projects, but is still relatively unused in bioinformatics. I'll discuss several reasons why this might be so, the major one being that few even know this topic exists.

In my talk I'll cover some of the standard techniques of usability design, including testing, persona development, use cases, and paper prototyping. These are simple, inexpensive techniques that can be applied to almost any project to make them more usable and enjoyable. To keep my presentation grounded, I'll include examples from my experiences in applying them to real projects.

Michael Eisen Keynote

Michael Eisen, University of California, Berkeley

August 1, 4:00-5:00

Creating an Electronic Public Library of Scientific Knowledge.

Winston Hide Keynote

Winston Hide, South African National Bioinformatics Institute

August 2, 9:00-10:00

Dr. Hide will be presenting on the impact of Open Source in the Real World.

OmniGene

Brian Gilman, Whitehead Institute

August 2, 10:00-10:25

The OmniGene project has produced modules to build web services for data analysis, integration, and visualization. OmniGene accomplishes this goal through the utilization of Java Enterprise and web service technologies. The core API consists of modules to:

perform queries across disparate databases,
Transform queries into commonly used XML formats,
Parse the output of these queries into an object graph,
Visualize and share knowledge in a client server or true peer to peer network,
Dynamically discover another web service,
Easily plug in analysis applications.

One major goal of the OmniGene development team is to abstract away XML parsing, Enterprise Java Bean, and transaction code from the bioinformatician so that they may concentrate on their data and data analysis.

The OmniGene system will utilize the output and API from the BioMOBY project to perform dynamic discovery of services. The BioMOBY and OmniGene developers are now working together to integrate their two projects. OmniGene is an open source, open standards initiative and is distributed under the BSD license.