Title: BioMart - a federated query architecture
BioMart is a simple, query-oriented
data integration system based on distributed data warehousing ideas.
It offers a flexible, fast and practical data-mining framework for
computer-savvy bioinformaticians as well as life scientists without any
programming experience. Originally developed as EnsMart for Ensembl, it
has now been successfuly applied to a variety of biological databases,
which can be accessed via the web and standalone interfaces.
The BioMart suite consits of a relational database schema specification,
an XML-based configuration system, administration tools for configuring
and deploying BioMart databases, and data access software written in perl
and java. A universal, query-optimised database schema, coupled with
domain-agnostic software are responsible for the key features of the
BioMart system: generic applicability, large query network-scalability and
RDBMS-platform portability. Thus, the system can be readily deployed
to provide a unified set of query interfaces to datasources residing
anywhere on the available network. In addition, simultaneous querying of
multiple data sources spread over any number of servers is supported
via query-chaining.
BioMart is an OpenSource project and all software is licensed under LGPL.
Title: BioMake: Functional Logical Task Management for Bioinformatics
A recurring pattern in bioinformatics architectures is the build
pattern, or pipeline. This can be defined as a computational
specification or template defining a collection of interdependent
tasks. Examples include biological sequence analysis pipelines and
data transformation pipelines (import and export of flatfiles, XML and
reports to and from relational databases).
Approaches range from the lightweight and generic to heavy duty
frameworks honed specifically for bioinformatics compute pipelines. An
example of the former is UNIX Makefiles, which is a configuration of
tasks where some files must be updated automatically from other files
whenever the other files change, and is primarily used for program
compilation. Examples of the latter include object-oriented systems
such as BioPipe, which are tightly integrated with the BioPerl
library.
For our in-house task management we required something similar to
Makefiles in terms of level of abstraction and simplicity, yet without the
limitations of Makefiles and related systems (ant, scons, build, etc). In
particular we needed:
- Asynchrnonous task management on compute farms
- Choice of either relational database or filesystem for storing build
targets
- A cleaner specification language
- Fully programmable logic within the Makefile specification
Our solution "BioMake" covers these requirements. It uses a declarative
language based around the concept of skolem functions. Each task in
the pipeline is specified as a function construct; for example, in a
genomic compute pipeline there may be function constructs "blastx(Seq,DB)"
and "genscan(Seq)". Each function construct represents a unique and
persistent identifier for the output of an executable. Functions can be
nested; for example "genscan(repeatmask(gi2177872))" represents the
results of running Genscan on a particular RepeatMasked sequence.
Dependent tasks are also specified as functions, and variable unification
is used as an alternative to Makefile-style pattern matching. Actions can
be parameterized using functions and variables. Functions are evaluated to
locators of the target data; for example, a filesystem path, or primary
key value in a database.
The task management engine is implemented in Prolog, and pipeline
specifications can use the Prolog code to provide full
programmability. Prolog is a declarative logic language and is
particularly suited to Makefile-style logic. However, the pipeline
programmer does not need to know Prolog in order to
construct or understand useful protocols.
The intention is to allow simple and concise specification of complex
pipelines. BioMake requires no object-oriented programming, and is not
tied to any particular language. We provide example customizable
compute pipelines which utilise standard bioinformatics analysis
programs such as BLAST, and infrastructure programs such as the
Apollo Bop parser, XSLT transforms and scripts using BioPerl.
More information on the system underlying BioMake can be found here
Presentation Slides
Title: BioRuby + KEGG API + KEGG DAS = wiring knowledge for genome and pathway
We have been developed BioRuby, a bioinformatics library for Ruby
language, which enable users to write analysis pipeline easily. Here we show the recent developments and how to
integrate BioRuby with KEGG web services (API and DAS) to automate your genome and pathway analysis procedure.
note KEGG API is a SOAP/WSDL based web service providing genes and pathway information. KEGG DAS is also a web
service providing genomic sequences and gene annotations via DAS protocol. Both services are also developed
by us and KEGG (Kyoto Encyclopedia Genes and Genomes) is freely accessed at http://www.genome.ad.jp/kegg/
* a URL for the project page, if applicable
BioRuby
KEGG API
KEGG DAS
* information about the open source license used for your software or
your release plans.
LGPL
On behalf of BioRuby project,
Toshiaki Katayama
Presentation Slides
Author: Levinson, Gene (NIH/NCI)
Title: caBIOperl: A new Perl API to the NCI's biomedical domain object middleware
A reality of the bioinformatics community, and one of its strengths, is its
diversity, including the range of programming languages that are utilized. However, this poses an accessibility problem
for federated web-based resources, unless the APIs and databases can be readily accessed by diverse software development
languages. The U.S. National Cancer Institute Center for Bioinformatics (NCICB) addresses this issue by providing a
diversified set of open-source application programming interfaces to its caCORE system. These interfaces, part of the
object-oriented middleware component known as caBIO, allow developers to write caCORE-powered applications using their
choice of a native Java API, a SOAP-XML API, or even a simple HTTP-XML interface.
Each of these APIs delivers the same data and conforms to the same domain object model.
Since caBIO was first released, Perl programmers have found it rather inconvenient to access the caCORE system because
(1) they have to package their search criteria in SOAP or HTTP format and send the request to the caCORE server via
the respective protocol; and (2) they have to parse the returned XML to extract the information they need. This has
proven burdensome. For this reason we undertook the development of a new Perl API, recently released and named caBIOperl.
The caBIOperl is completely object-oriented. It provides an abstraction layer from SOAP and XML, so that Java programmers
will be working with caBIO objects, similar to what a Java programmer experiences with the native caBIO Java API.
caBIOperl wraps the lower-level SOAP and DOM packages, and thus shields the developer from needing to understand SOAP
or parse the XML. The first public release came out in April, 2004, and provides query access to 32 caBIO objects,
including ClinicalTrialProtocol, Pathway, and Gene.
caBIOperl thus provides native Perl access that allows developers to customize queries according to the specialized needs
of their local investigative teams. caBIOperl modules can be downloaded from the caBIO section of the NCICB download site.
Presentation Slides
Author: Gessler, D., G. Schiltz, & L. Stein.
Title: Semantic MOBY as a World Wide Web architecture for bioinformatic interoperability
MOBY is an open source project for achieving interoperability in bioinformatics.
Research and development has proceeded along a dual-development track that consists of MOBY Services (with an emphasis on
SOAP technologies in a web services model) and Semantic MOBY (with an emphasis on RDF/OWL-DL in a semantic web model).
Semantic MOBY is designed specifically to operate in a nebulous and ever-changing world. In Semantic MOBY we identified
three problems that are hindering widely deployable, scalable interoperability, namely the: i) fatal mutability of
traditional interfaces (if a provider changes its interface, client code depending on that interface fails en masse);
ii) rigidity and fragility of static classification schemes (changing the properties of a class near the root
of an inheritance hierarchy simultaneously affects the entire sub-tree); and iii) confounding structure and content
(content is entangled with the presentation layer and/or implicit behaviors of the presentation software).
Addressing these problems essentially recasts the problem of interoperability from being one of simply specifying a
syntax and messaging layer for syntactically connecting clients and providers via information in a registry look-up, to
being one of providing clients and providers a way to semantically describe their data and identify data relevant to
them. Our measure of success is to build an architecture that delivers: i) a common syntax; ii) a shared semantic and
mechanism for semantic negotiation; iii) a discovery mechanism. This talk presents the Semantic MOBY architecture and
API and shows how this is accomplished.
Website: www.biomoby.org
Open Source License: Artistic PERL
Presentation Slides
Title: BioJava
BioJava is a pure Java framework which is useful for developing a wide
range of bioinformatics software, from small research scripts to
complex interactive applications. It includes powerful object models
for handling sequence and other kinds of biological data, and tools for
integrating and querying this information. It also provides a solid
foundation for developing novel analysis methods. General-purpose
implementations of techniques such as Hidden Markov Models and support
vector machines are included in the package.
BioJava was first released over four years ago. It is now an
established project and is widely used and supported around the world.
Significant improvements in the past year include the addition of a
data model for 3D structure information, better database support, and
improvements that make BioJava more powerful in a distributed computing
environment.
I will be talking about the status of the BioJava project and the kind
of problems for which it has proven useful, discussing its future
directions, and considering the issues involved in maintaining a large
software library.
URL: http://www.biojava.org/
Licence: LGPL
Title: Applying software validation techniques to Bioperl
With computer software playing an increasingly pervasive role in
society, the risks associated with software failures have begun
receiving more attention. Infamous examples of such software failures
include the loss of the Mars Climate Orbiter (a victim of a metric vs.
imperial unit conversion error) and the fatal overdoses administered by
the Therac-25 medical accelerator (caused by an integer overflow). Even
when not catastrophic, software failure can be extremely costly: the US
Commerce Department's National Institute of Science and Technology
(NIST) estimated in 2002 that poor-quality software costs US businesses
nearly $60 billion per year.
Concern about the costs and other risks of software failure has led to
increasing interest in 'software validation'. The US FDA defines
software validation as "confirmation by examination and provision of
objective evidence that software specifications conform to user needs
and intended uses, and that the particular requirements implemented
through software can be consistently fulfilled." In the commercial
world, this process of examination and evidence gathering tends to be
specified by formal procedures (e.g., TQM and ISO 9001) applied in the
context of formal software development methodologies.
In the open source world, collaborative development makes formal
procedures hard to apply. Instead, open source projects rely on "many
eyes mak[ing] all bugs shallow" (Eric S. Raymond). Unfortunately,
however, in a large project like Bioperl, not all components are used
equally frequently, and thus not every component is examined equally
thoroughly or often.
In order to remedy these shortcomings of the open source development
process, a systematic approach is needed. The existing code, tests and
documentation must be examined from the point of view of validation,
allowing us to bridge the gap between cooperative development (open
source), and the more formal, contractual space of commercial
development.
We have established a validation process and applied it to Bioperl. The
resulting validation framework has been developed in such a way that it
can be applied readily to other open source projects (e.g. Biojava).
The validation process, including documentation, Bioperl code changes
and novel test code developed will be described, as well as the overall
quality, reliability and usability improvements that result. We aim to
demonstrate how validation of Bioperl significantly increases its value
for all stakeholders.
LICENSING:
The Bioperl project addressed in the talk is licensed under the Perl
Artistic License, an accepted open source license according to the Open
Source Initiative. The work performed by Electric Genetics, as
described in the talk, results in two outcomes:
1) ongoing contributions to the Bioperl suite, including improved
error handling, bug fixes and code additions. These all fall under the
Perl Artistic License and will form significant contributions to the
open source project.
2) commercial documentation and validation suite, offered to clients
as a commercial product. The documentation will be provided to paying
clients on a commercial basis and, thus, will not be immediately placed
in the Bioperl repository. The validation suite will be retained by
Electric Genetics and validation services offered to clients. If a
client wishes to purchase the validation suite, it will be licensed
using a commercial license.
The business and licensing model we describe is similar to that of e.g.
Novell, who offer both commercial products (e.g. the Linux admin product
Red Carpet) as well as ongoing contributions to open source projects.
PROJECT URL:
http://www.egenetics.com/opensource.html
Presentation Slides
Title: EMBOSS: The European Molecular biology Open Software Suite
EMBOSS started as an open source sequence analysis package and now
extends into protein structure, phylogenetics and other areas. A key
feature is the ease of integrating EMBOSS into other interfaces (web, GUI, SOAP, workflows, etc.)
URL: http://www.emboss.org/
Licence: GPL (and LGPL for the libraries and for associated packages)
Presentation Slides
Title: The NCBI C++ Software Development
The NCBI is the host and developer of the world's largest bioinformatics projects.
As such, it has developed an extensive, powerful, documented and freely available bioinformatics programming platform
that contains a rich and robust set of functionalities designed to handle the intrinsic complexities of biology.
The NCBI C++ toolkit provides portable application framework classes for argument processing, diagnostics, exceptions,
connection streams, stream wrappers and threads. The C++ code generator tool transforms ASN.1 data specifications
into ready-to-use, error-free set of C++ classes and functions to liberate the programmer from writing class variable
methods while providing garbage collection and object serialization to ASN.1/XML. An object manager facilitates
heterogeneous access to biological sequence data for annotation and display. Moreover, the toolkit offers excellent
support for database independent projects and complex CGI applications. This talk will provide a high-level overview
of the features and tools available in the NCBI C++ toolkit that enable computational investigations in biology
by third-party developers.
URL: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/
Presentation Slides
Title: Life Sciences Identifiers. Finally?
Life Sciences Identifiers (LSIDs) are persistent,
location-independent, resource identifiers for uniquely naming
biologically significant resources including but not limited to
individual genes or proteins, or data objects that encode information
about them.
Their specification includes not only their syntax but defines also a
set of middleware-independent interfaces for resolving the
identifiers, and allowing access to their associated metadata (such as
annotations).
The LSID Assigning service is responsible for creation of LSIDs for
given data entities.
URL:
http://www.omg.org/cgi-bin/doc?lifesci/03-12-02
http://www-124.ibm.com/developerworks/oss/lsid/
Presentation Slides
Title: The PSI MI standard - open analysis of protein interaction data
The HUPO PSI protein interaction work group has jointly developed an XML
standard for the representation of protein interaction data, the PSI MI
format. PSI MI data is now available from major interaction data
providers, including DIP, MINT, and IntAct. Based on the PSI MI
standard, database and analysis tools from different providers can be
joined to efficiently analyse and manipulate protein interaction data.
We will present the IntAct, an open source protein interaction database
and analysis tool which provides extensive PSI MI support. The web
interface provides both textual and graphical representations of protein
interactions, and allows exploring interaction networks in the context
of the GO annotations of the interacting proteins. IntAct is Java-based,
with Jakarta OJB object-relational mapping to Postgres or Oracle. PSI MI
upload and download are possible as well as dynamic access to
interaction networks by a web service or search URL. The direct URL
access allows to directly access and further analyse PSI MI data in the
open source tools ProViz and Cytoscape. These, in turn, provide a choice
of fast network visualisation algorithms, integration with expression
data, path finding and clustering in interaction networks.
Project URLs:
http://psidev.sf.net
http://intact.sf.net
http://www.cytoscape.org
Presentation Slides
Title: GMOD: The Generic Model Organism Database Project
The Generic Model Organism Database (GMOD) Project is an open source
project to develop a complete set of software for creating and
administering a model organism database. Components of this project
include genome visualization and editing tools, literature curation
tools, a robust database schema, biological ontology tools, and a set
of standard operating procedures. This project is funded by the NIH
and the USDA Agricultural Research Service, with participation from
members of several database projects, including WormBase, FlyBase,
Mouse Genome Informatics, Gramene, the Rat Genome Database, TAIR,
EcoCyc, and the Saccharomyces Genome Database.
Released modules include Chado, a flexible modular relational schema
for genome information, Apollo, a genome feature editor and curator's
tool, GBrowse, a flexible web-based genome browser, Textpresso, a
paper indexing and search tool, the PubSearch/PubFetch literature
curation tools, and Caryoscope, a gene expression visualization
tools. Over the next year we will be releasing more components,
ultimately creating a model organism database construction set.
This talk will survey the released and pending GMOD tools, and
describe how they can be used for a variety of large and small
projects. The project URL is http://www.gmod.org
GMOD is released under a variety of Open Source licenses, primarily
the Perl Artistic License and GNU GPL.
Title: Ensembl - a portable Genome toolkit
Ensembl is a genome information system designed for handling large
genomes, in particular human, mouse and other vertebrates. Its major code
bases can be broken down into three sections: a core relational schema and
API, a computational pipeline system and a user-friendly web site. The
Ensembl system has been designed principally to enable biologists to use
vertebrate genomes, but the source code of Ensembl is open source and
there has been increasing modularisation and clean-up of the system. This
means that Ensembl software has become increasingly useful as toolkit
itself for other genomes: we currently know of at least 8 genomes that
have been loaded and displayed using the Ensembl software outside of the
main Ensembl group.
I will present the aspects of Ensembl which are most open to reuse, in
particular how to load and run a new genome into Ensembl from existing,
flat file annotation, and sense of how to extend Ensembl, either using the
configureable DAS protocol or via schema additions. I will also briefly
outline the main concepts behind the pipeline.
License: BSD-style.
Presentation Slides
Title: GUS - A Functional Genomics Infrastructure System
The Genomics Unified Schema (GUS) is a functional genomics
infrastructure system in use at about 20 projects across approximately a
dozen institutions. GUS was developed at the Computational Biology and
Informatics Lab (CBIL) as the infrastructure for PlasmoDB , EPConDB and AllGenes. Over the last year we
have packaged GUS for distribution and moved its development to open
source which has resulted in an active user and development community.
GUS includes a relational schema with more than 400 tables and views
covering approximately 50 functional genomics concepts. The schema is
organized into five name spaces. DoTS covers the central dogma (genes,
RNAs, proteins); sequence and features; reagents, including clones,
mapping and gene traps. RAD covers microarray experiments in a
MIAME-compliant representation. TESS covers transcription region
regulation; SRes covers controlled vocabularies, including about a dozen
standards-based vocabularies and ontologies. Finally, Core covers
non-biological concepts used to track users and data.
Upcoming schema expansion includes additional technologies (2-D gel and
mass spectrometry, in situ hybridizations) that will make use of common
experimental design and sample tables currently residing in the RAD
schema. We plan to work with emerging standards efforts for these
domains paralleling our involvement in the MGED effort for microarray
experiment information.
GUS also provides an application framework that includes a Perl and Java
object-relational layer; a Data Load API; many "plugins" to load
standard data sources; a Pipeline API to specify analysis protocols; and
a Web Development Kit (WDK). The WDK assists in the development of
data-mining oriented websites such as PlasmoDB. It provides a servlet
framework, a declarative format to specify queries, results and records,
page layout, many sample queries and query result caching. The next
generation WDK is under development in collaboration with the GeneDB project at the Pathogen
Sequencing Unit of the Sanger Center, and uses a Struts and JSP based
model-view-controller design.
GUS runs under Linux, Tomcat and Oracle. PostgreSQL compatibility is
near completion. The source is freely available.
Homepage: www.gusdb.org
Title: The Otter Annotation System
The VEGA database presents high quality manual annotation of finished vertebrate
genomes. Until recently the finished clones that constitute the tiling path of the chromosome were annotated individually. Tags in the data objects that represented parts of RNA transcripts that span several clones were used to describe how
they should be fused. Fusing occurred during a conversion process that created an Ensembl database containing the
complete gene structures.
The otter project was developed in order to present the annotator with a view of a contiguous region of a chromosome made
from several clones, and to avoid the conversion step by storing the annotation directly in an Ensembl database.
The gene annotation data is passed between the annotation client and Ensembl database server in an XML format. The XML
contains the clone assembly information along with the gene structure data. It is hoped that the XML format will be
adopted as an exchange format by other centers who wish to display their annotation in VEGA.
The otter schema is an extension of the Ensembl database SQL schema. Additional tables store textual information about
transcripts, genes and clones added by the annotator, implement a clone level locking mechanism, and keep track of the
authors of particular annotations. These are accompanied by corresponding additions to the Ensembl Perl API. A
lightweight HTTP server written in perl, otter_srv, exchanges XML with the client and saves the annotator's changes
to the MySQL otter database in a single transaction.
The annotators' graphical interface, otterlace, now incorporates a number of improvements, such as the display
of gapped alignments of sequence database hits to the genomic sequence.
The core otter software is available, under the same licence as Ensembl, by anonymous CVS (package ensemblotter) from cvs.sanger.ac.uk, where it will be joined by the otterlace client
software. It is anticipated that a packaged distribution will also be created. The code is already in use by
some of our collaborators outside the Sanger Institute.