Bioinformatics Open Source Conference (BOSC)

Title: BioMart - a federated query architecture

BioMart is a simple, query-oriented data integration system based on distributed data warehousing ideas. It offers a flexible, fast and practical data-mining framework for computer-savvy bioinformaticians as well as life scientists without any programming experience. Originally developed as EnsMart for Ensembl, it has now been successfuly applied to a variety of biological databases, which can be accessed via the web and standalone interfaces.

The BioMart suite consits of a relational database schema specification, an XML-based configuration system, administration tools for configuring and deploying BioMart databases, and data access software written in perl and java. A universal, query-optimised database schema, coupled with domain-agnostic software are responsible for the key features of the BioMart system: generic applicability, large query network-scalability and RDBMS-platform portability. Thus, the system can be readily deployed to provide a unified set of query interfaces to datasources residing anywhere on the available network. In addition, simultaneous querying of multiple data sources spread over any number of servers is supported via query-chaining.

BioMart is an OpenSource project and all software is licensed under LGPL.

Author: Chris Mungall

Title: BioMake: Functional Logical Task Management for Bioinformatics

A recurring pattern in bioinformatics architectures is the build pattern, or pipeline. This can be defined as a computational specification or template defining a collection of interdependent tasks. Examples include biological sequence analysis pipelines and data transformation pipelines (import and export of flatfiles, XML and reports to and from relational databases).

Approaches range from the lightweight and generic to heavy duty frameworks honed specifically for bioinformatics compute pipelines. An example of the former is UNIX Makefiles, which is a configuration of tasks where some files must be updated automatically from other files whenever the other files change, and is primarily used for program compilation. Examples of the latter include object-oriented systems such as BioPipe, which are tightly integrated with the BioPerl library.

For our in-house task management we required something similar to Makefiles in terms of level of abstraction and simplicity, yet without the limitations of Makefiles and related systems (ant, scons, build, etc). In particular we needed:

- Asynchrnonous task management on compute farms
- Choice of either relational database or filesystem for storing build targets
- A cleaner specification language
- Fully programmable logic within the Makefile specification

Our solution "BioMake" covers these requirements. It uses a declarative language based around the concept of skolem functions. Each task in the pipeline is specified as a function construct; for example, in a genomic compute pipeline there may be function constructs "blastx(Seq,DB)" and "genscan(Seq)". Each function construct represents a unique and persistent identifier for the output of an executable. Functions can be nested; for example "genscan(repeatmask(gi2177872))" represents the results of running Genscan on a particular RepeatMasked sequence. Dependent tasks are also specified as functions, and variable unification is used as an alternative to Makefile-style pattern matching. Actions can be parameterized using functions and variables. Functions are evaluated to locators of the target data; for example, a filesystem path, or primary key value in a database.

The task management engine is implemented in Prolog, and pipeline specifications can use the Prolog code to provide full programmability. Prolog is a declarative logic language and is particularly suited to Makefile-style logic. However, the pipeline programmer does not need to know Prolog in order to construct or understand useful protocols.

The intention is to allow simple and concise specification of complex pipelines. BioMake requires no object-oriented programming, and is not tied to any particular language. We provide example customizable compute pipelines which utilise standard bioinformatics analysis programs such as BLAST, and infrastructure programs such as the Apollo Bop parser, XSLT transforms and scripts using BioPerl.

More information on the system underlying BioMake can be found here

Presentation Slides

Author: Toshiaki Katayama

Title: BioRuby + KEGG API + KEGG DAS = wiring knowledge for genome and pathway

We have been developed BioRuby, a bioinformatics library for Ruby language, which enable users to write analysis pipeline easily. Here we show the recent developments and how to integrate BioRuby with KEGG web services (API and DAS) to automate your genome and pathway analysis procedure. note KEGG API is a SOAP/WSDL based web service providing genes and pathway information. KEGG DAS is also a web service providing genomic sequences and gene annotations via DAS protocol. Both services are also developed by us and KEGG (Kyoto Encyclopedia Genes and Genomes) is freely accessed at http://www.genome.ad.jp/kegg/

* a URL for the project page, if applicable
BioRuby
KEGG API
KEGG DAS

* information about the open source license used for your software or your release plans.
LGPL

On behalf of BioRuby project, Toshiaki Katayama

Presentation Slides

Author: Levinson, Gene (NIH/NCI)

Title: caBIOperl: A new Perl API to the NCI's biomedical domain object middleware

A reality of the bioinformatics community, and one of its strengths, is its diversity, including the range of programming languages that are utilized. However, this poses an accessibility problem for federated web-based resources, unless the APIs and databases can be readily accessed by diverse software development languages. The U.S. National Cancer Institute Center for Bioinformatics (NCICB) addresses this issue by providing a diversified set of open-source application programming interfaces to its caCORE system. These interfaces, part of the object-oriented middleware component known as caBIO, allow developers to write caCORE-powered applications using their choice of a native Java API, a SOAP-XML API, or even a simple HTTP-XML interface.

Each of these APIs delivers the same data and conforms to the same domain object model.

Since caBIO was first released, Perl programmers have found it rather inconvenient to access the caCORE system because (1) they have to package their search criteria in SOAP or HTTP format and send the request to the caCORE server via the respective protocol; and (2) they have to parse the returned XML to extract the information they need. This has proven burdensome. For this reason we undertook the development of a new Perl API, recently released and named caBIOperl.

The caBIOperl is completely object-oriented. It provides an abstraction layer from SOAP and XML, so that Java programmers will be working with caBIO objects, similar to what a Java programmer experiences with the native caBIO Java API.

caBIOperl wraps the lower-level SOAP and DOM packages, and thus shields the developer from needing to understand SOAP or parse the XML. The first public release came out in April, 2004, and provides query access to 32 caBIO objects, including ClinicalTrialProtocol, Pathway, and Gene.

caBIOperl thus provides native Perl access that allows developers to customize queries according to the specialized needs of their local investigative teams. caBIOperl modules can be downloaded from the caBIO section of the NCICB download site.

Presentation Slides

Author: Gessler, D., G. Schiltz, & L. Stein.

Title: Semantic MOBY as a World Wide Web architecture for bioinformatic interoperability

MOBY is an open source project for achieving interoperability in bioinformatics. Research and development has proceeded along a dual-development track that consists of MOBY Services (with an emphasis on SOAP technologies in a web services model) and Semantic MOBY (with an emphasis on RDF/OWL-DL in a semantic web model). Semantic MOBY is designed specifically to operate in a nebulous and ever-changing world. In Semantic MOBY we identified three problems that are hindering widely deployable, scalable interoperability, namely the: i) fatal mutability of traditional interfaces (if a provider changes its interface, client code depending on that interface fails en masse); ii) rigidity and fragility of static classification schemes (changing the properties of a class near the root of an inheritance hierarchy simultaneously affects the entire sub-tree); and iii) confounding structure and content (content is entangled with the presentation layer and/or implicit behaviors of the presentation software).

Addressing these problems essentially recasts the problem of interoperability from being one of simply specifying a syntax and messaging layer for syntactically connecting clients and providers via information in a registry look-up, to being one of providing clients and providers a way to semantically describe their data and identify data relevant to them. Our measure of success is to build an architecture that delivers: i) a common syntax; ii) a shared semantic and mechanism for semantic negotiation; iii) a discovery mechanism. This talk presents the Semantic MOBY architecture and API and shows how this is accomplished.

Website: www.biomoby.org

Open Source License: Artistic PERL

Presentation Slides

Author: Thomas Down

Title: BioJava

BioJava is a pure Java framework which is useful for developing a wide range of bioinformatics software, from small research scripts to complex interactive applications. It includes powerful object models for handling sequence and other kinds of biological data, and tools for integrating and querying this information. It also provides a solid foundation for developing novel analysis methods. General-purpose implementations of techniques such as Hidden Markov Models and support vector machines are included in the package.

BioJava was first released over four years ago. It is now an established project and is widely used and supported around the world. Significant improvements in the past year include the addition of a data model for 3D structure information, better database support, and improvements that make BioJava more powerful in a distributed computing environment.

I will be talking about the status of the BioJava project and the kind of problems for which it has proven useful, discussing its future directions, and considering the issues involved in maintaining a large software library.

URL: http://www.biojava.org/ Licence: LGPL

Author: Peter van Heusden

Title: Applying software validation techniques to Bioperl

With computer software playing an increasingly pervasive role in society, the risks associated with software failures have begun receiving more attention. Infamous examples of such software failures include the loss of the Mars Climate Orbiter (a victim of a metric vs. imperial unit conversion error) and the fatal overdoses administered by the Therac-25 medical accelerator (caused by an integer overflow). Even when not catastrophic, software failure can be extremely costly: the US Commerce Department's National Institute of Science and Technology (NIST) estimated in 2002 that poor-quality software costs US businesses nearly $60 billion per year.

Concern about the costs and other risks of software failure has led to increasing interest in 'software validation'. The US FDA defines software validation as "confirmation by examination and provision of objective evidence that software specifications conform to user needs and intended uses, and that the particular requirements implemented through software can be consistently fulfilled." In the commercial world, this process of examination and evidence gathering tends to be specified by formal procedures (e.g., TQM and ISO 9001) applied in the context of formal software development methodologies.

In the open source world, collaborative development makes formal procedures hard to apply. Instead, open source projects rely on "many eyes mak[ing] all bugs shallow" (Eric S. Raymond). Unfortunately, however, in a large project like Bioperl, not all components are used equally frequently, and thus not every component is examined equally thoroughly or often.

In order to remedy these shortcomings of the open source development process, a systematic approach is needed. The existing code, tests and documentation must be examined from the point of view of validation, allowing us to bridge the gap between cooperative development (open source), and the more formal, contractual space of commercial development.

We have established a validation process and applied it to Bioperl. The resulting validation framework has been developed in such a way that it can be applied readily to other open source projects (e.g. Biojava). The validation process, including documentation, Bioperl code changes and novel test code developed will be described, as well as the overall quality, reliability and usability improvements that result. We aim to demonstrate how validation of Bioperl significantly increases its value for all stakeholders.

LICENSING: The Bioperl project addressed in the talk is licensed under the Perl Artistic License, an accepted open source license according to the Open Source Initiative. The work performed by Electric Genetics, as described in the talk, results in two outcomes:
1) ongoing contributions to the Bioperl suite, including improved error handling, bug fixes and code additions. These all fall under the Perl Artistic License and will form significant contributions to the open source project.
2) commercial documentation and validation suite, offered to clients as a commercial product. The documentation will be provided to paying clients on a commercial basis and, thus, will not be immediately placed in the Bioperl repository. The validation suite will be retained by Electric Genetics and validation services offered to clients. If a client wishes to purchase the validation suite, it will be licensed using a commercial license.

The business and licensing model we describe is similar to that of e.g. Novell, who offer both commercial products (e.g. the Linux admin product Red Carpet) as well as ongoing contributions to open source projects.

PROJECT URL: http://www.egenetics.com/opensource.html

Presentation Slides

Author: Michel Dumontier

Title: The NCBI C++ Software Development

The NCBI is the host and developer of the world's largest bioinformatics projects. As such, it has developed an extensive, powerful, documented and freely available bioinformatics programming platform that contains a rich and robust set of functionalities designed to handle the intrinsic complexities of biology. The NCBI C++ toolkit provides portable application framework classes for argument processing, diagnostics, exceptions, connection streams, stream wrappers and threads. The C++ code generator tool transforms ASN.1 data specifications into ready-to-use, error-free set of C++ classes and functions to liberate the programmer from writing class variable methods while providing garbage collection and object serialization to ASN.1/XML. An object manager facilitates heterogeneous access to biological sequence data for annotation and display. Moreover, the toolkit offers excellent support for database independent projects and complex CGI applications. This talk will provide a high-level overview of the features and tools available in the NCBI C++ toolkit that enable computational investigations in biology by third-party developers.

URL: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/

Presentation Slides

Author: Martin Senger

Title: Life Sciences Identifiers. Finally?

Life Sciences Identifiers (LSIDs) are persistent, location-independent, resource identifiers for uniquely naming biologically significant resources including but not limited to individual genes or proteins, or data objects that encode information about them.

Their specification includes not only their syntax but defines also a set of middleware-independent interfaces for resolving the identifiers, and allowing access to their associated metadata (such as annotations).

The LSID Assigning service is responsible for creation of LSIDs for given data entities.

URL: http://www.omg.org/cgi-bin/doc?lifesci/03-12-02
http://www-124.ibm.com/developerworks/oss/lsid/

Presentation Slides

Author: Henning Hermjakob

Title: The PSI MI standard - open analysis of protein interaction data

The HUPO PSI protein interaction work group has jointly developed an XML standard for the representation of protein interaction data, the PSI MI format. PSI MI data is now available from major interaction data providers, including DIP, MINT, and IntAct. Based on the PSI MI standard, database and analysis tools from different providers can be joined to efficiently analyse and manipulate protein interaction data. We will present the IntAct, an open source protein interaction database and analysis tool which provides extensive PSI MI support. The web interface provides both textual and graphical representations of protein interactions, and allows exploring interaction networks in the context of the GO annotations of the interacting proteins. IntAct is Java-based, with Jakarta OJB object-relational mapping to Postgres or Oracle. PSI MI upload and download are possible as well as dynamic access to interaction networks by a web service or search URL. The direct URL access allows to directly access and further analyse PSI MI data in the open source tools ProViz and Cytoscape. These, in turn, provide a choice of fast network visualisation algorithms, integration with expression data, path finding and clustering in interaction networks.

Project URLs:
http://psidev.sf.net
http://intact.sf.net
http://www.cytoscape.org

Presentation Slides

Author: Lincoln Stein

Title: GMOD: The Generic Model Organism Database Project

The Generic Model Organism Database (GMOD) Project is an open source project to develop a complete set of software for creating and administering a model organism database. Components of this project include genome visualization and editing tools, literature curation tools, a robust database schema, biological ontology tools, and a set of standard operating procedures. This project is funded by the NIH and the USDA Agricultural Research Service, with participation from members of several database projects, including WormBase, FlyBase, Mouse Genome Informatics, Gramene, the Rat Genome Database, TAIR, EcoCyc, and the Saccharomyces Genome Database.

Released modules include Chado, a flexible modular relational schema for genome information, Apollo, a genome feature editor and curator's tool, GBrowse, a flexible web-based genome browser, Textpresso, a paper indexing and search tool, the PubSearch/PubFetch literature curation tools, and Caryoscope, a gene expression visualization tools. Over the next year we will be releasing more components, ultimately creating a model organism database construction set.
This talk will survey the released and pending GMOD tools, and describe how they can be used for a variety of large and small projects. The project URL is http://www.gmod.org

GMOD is released under a variety of Open Source licenses, primarily the Perl Artistic License and GNU GPL.

Author: Ewan Birney

Title: Ensembl - a portable Genome toolkit

Ensembl is a genome information system designed for handling large genomes, in particular human, mouse and other vertebrates. Its major code bases can be broken down into three sections: a core relational schema and API, a computational pipeline system and a user-friendly web site. The Ensembl system has been designed principally to enable biologists to use vertebrate genomes, but the source code of Ensembl is open source and there has been increasing modularisation and clean-up of the system. This means that Ensembl software has become increasingly useful as toolkit itself for other genomes: we currently know of at least 8 genomes that have been loaded and displayed using the Ensembl software outside of the main Ensembl group.

I will present the aspects of Ensembl which are most open to reuse, in particular how to load and run a new genome into Ensembl from existing, flat file annotation, and sense of how to extend Ensembl, either using the configureable DAS protocol or via schema additions. I will also briefly outline the main concepts behind the pipeline.

License: BSD-style.

Presentation Slides

Author: Steve Fischer

Title: GUS - A Functional Genomics Infrastructure System

The Genomics Unified Schema (GUS) is a functional genomics infrastructure system in use at about 20 projects across approximately a dozen institutions. GUS was developed at the Computational Biology and Informatics Lab (CBIL) as the infrastructure for PlasmoDB , EPConDB and AllGenes. Over the last year we have packaged GUS for distribution and moved its development to open source which has resulted in an active user and development community.

GUS includes a relational schema with more than 400 tables and views covering approximately 50 functional genomics concepts. The schema is organized into five name spaces. DoTS covers the central dogma (genes, RNAs, proteins); sequence and features; reagents, including clones, mapping and gene traps. RAD covers microarray experiments in a MIAME-compliant representation. TESS covers transcription region regulation; SRes covers controlled vocabularies, including about a dozen standards-based vocabularies and ontologies. Finally, Core covers non-biological concepts used to track users and data.

Upcoming schema expansion includes additional technologies (2-D gel and mass spectrometry, in situ hybridizations) that will make use of common experimental design and sample tables currently residing in the RAD schema. We plan to work with emerging standards efforts for these domains paralleling our involvement in the MGED effort for microarray experiment information.

GUS also provides an application framework that includes a Perl and Java object-relational layer; a Data Load API; many "plugins" to load standard data sources; a Pipeline API to specify analysis protocols; and a Web Development Kit (WDK). The WDK assists in the development of data-mining oriented websites such as PlasmoDB. It provides a servlet framework, a declarative format to specify queries, results and records, page layout, many sample queries and query result caching. The next generation WDK is under development in collaboration with the GeneDB project at the Pathogen Sequencing Unit of the Sanger Center, and uses a Struts and JSP based model-view-controller design.

GUS runs under Linux, Tomcat and Oracle. PostgreSQL compatibility is near completion. The source is freely available.

Homepage: www.gusdb.org

Author: James Gilbert

Title: The Otter Annotation System

The VEGA database presents high quality manual annotation of finished vertebrate genomes. Until recently the finished clones that constitute the tiling path of the chromosome were annotated individually. Tags in the data objects that represented parts of RNA transcripts that span several clones were used to describe how they should be fused. Fusing occurred during a conversion process that created an Ensembl database containing the complete gene structures.

The otter project was developed in order to present the annotator with a view of a contiguous region of a chromosome made from several clones, and to avoid the conversion step by storing the annotation directly in an Ensembl database.

The gene annotation data is passed between the annotation client and Ensembl database server in an XML format. The XML contains the clone assembly information along with the gene structure data. It is hoped that the XML format will be adopted as an exchange format by other centers who wish to display their annotation in VEGA.

The otter schema is an extension of the Ensembl database SQL schema. Additional tables store textual information about transcripts, genes and clones added by the annotator, implement a clone level locking mechanism, and keep track of the authors of particular annotations. These are accompanied by corresponding additions to the Ensembl Perl API. A lightweight HTTP server written in perl, otter_srv, exchanges XML with the client and saves the annotator's changes to the MySQL otter database in a single transaction.

The annotators' graphical interface, otterlace, now incorporates a number of improvements, such as the display of gapped alignments of sequence database hits to the genomic sequence.

The core otter software is available, under the same licence as Ensembl, by anonymous CVS (package ensemblotter) from cvs.sanger.ac.uk, where it will be joined by the otterlace client software. It is anticipated that a packaged distribution will also be created. The code is already in use by some of our collaborators outside the Sanger Institute.