BOSC2003 Abstracts

BOSC 2003 Talk Abstracts
View the program

BioJava Turns 1.3

Mark Schreiber, Thomas Down, David Huen, Keith James, Matthew Pocock

At the time of writing the BioJava project, now in its fifth year, is preparing its 1.3 release. This release represents a considerable step forward in the stability, usability and documentation of the core APIs. The project has close to 1000 java classes contributed by 38 authors. BioJava is distributed under the LGPL license. The improvement in usability has been brought about by the widespread introduction of convenience methods to perform common tasks. A cookbook style project called BioJava in Anger is available on the web (bioconf.otago.ac.nz/biojava) to provide recipes for common tasks and continued improvement of the javadocs have helped to reduce the learning curve

BioJava handles many common bioinformatics tasks such as file parsing, sequence manipulation, sequence statistics and analysis, dynamic programming, and HMMs. BioJava also supports many "Enterprise" level bioinformatics tasks with support for the OBDA specifications such as BioSQL, DAS, remote data sources and Ensembl bindings. There is also robust serialization and distributed programming support and recent testing indicates BioJava 1.3 is compliant with J2EE technologies which should enable the rapid development of very scalable BioJava based bioinformatics solutions.

Homepage: http://www.biojava.org/

27 June, 9:50 - 10:20

BioRuby project and the KEGG API

Toshiaki Katayama, Naohisa Goto, Mitsuteru C. Nakao, Shuichi Kawashima, Yoko Sato, Minoru Kanehisa
Bioinformatics center, Kyoto University, Japan

BioRuby is a class library for bioinformatics written in an object oriented scripting language Ruby. The Ruby language, born in Japan, is now getting popular over the world after its 10 years of development. We believe Ruby is one of the easiest languages for the object oriented programming because of its clean syntax. This makes beginners to start using Ruby very quickly without learning the #*$%@ing enchantments. Furthermore, Ruby is also powerful enough for the various bioinformatics tasks by its unlimited data structures, by its ability to manipulate strings with Perl-like regular expressions, and by its extensibility with the external C library. Thus, BioRuby will be suitable both for the bioinformatics begginers in the wet lab and the hard-core developers, simultaneously.

BioRuby has been developed over the past 2 years and is compliant with the OBDA (Open Bio Database Access) specifications to access sequence databases by cooperating with other precedent Open Bio* projects including BioJava, BioPerl and BioPython. As well as the OBDA, BioRuby can handle biological sequences, parse and create index for over 20 flat file database formats including KEGG databases, executes local and/or remote blast/fasta/hmmer searches, accessing DAS based genome databases for sequence annotations and PubMed database for reference information etc.

Recently, we have added a new feature called KEGG API which provides valuable means for accessing the KEGG system at the GenomeNet in Japan. The KEGG API is a SOAP based web service for searching and computing biochemical pathways in cellular processes and analyzing the universe of genes in the completely sequenced genomes. The BioRuby interface of the KEGG API enables the users to easily write customized procedures for automated analyses of the pathways and/or the gene universe.

Homepage: http://www.genome.ad.jp/kegg/soap/

27 June, 10:40 - 11:10

BioPerl in 2003: a users perspective

Niel Saunders
School of Biotechnology and Biomolecular Sciences, The University of New South Wales

The BioPerl project, officially organised in 1995, has developed into a mature collection of Perl modules for building solutions to bioinformatics problems. BioPerl now contains a large number of modules that enable researchers to perform many fundamental tasks in bioinformatics analysis. These include: accessing sequence data from remote or local databases, creating or converting sequence files in various formats, parsing the output from software packages (such as BLAST), manipulating biological data (e.g. sequence alignments, phylogenetic trees, structure files), running external programs and even querying bibliographic databases. The BioPerl community is an open, active group and welcomes contributions from developers and users.

The first part of this talk is an overview of the BioPerl project and includes a brief look at recent developments. In the second part, a number of real examples will be discussed to illustrate how BioPerl is being used in a microbial genomics research group.

Homepage: http://www.bioperl.org/

27 June, 11:10 - 11:40

Persistent Bioperl

Hilmar Lapp
Genomics Institute of the Novartis Research Foundation

Bioperl-db adds transparent database persistence to the Bioperl object model. The package provides the client with the ability to turn a given Bioperl object into a so-called persistent object, which speaks the same APIs that the original Bioperl object did, but in addition also is able to handle persistence operations, like insert, update, and delete.

This enables the programmer to manipulate objects after they have been retrieved from or inserted into the database, and then update with a single function call the respective rows in the database to reflect those changes. As a practical example, one can retrieve a sequence object by accession from the underlying database, then programmatically or interactively manipulate the annotation for that sequence object by adding, changing, or removing features, database cross-references, and tag/value pairs. Subsequently one can update the database to reflect the changed annotation by asking the sequence object to update itself, which will cascade down to its annotation.

The actual persistence operations are presently implemented with bindings to a Biosql schema served by either MySQL, PostgreSQL, or Oracle. One aim of the design was to minimize the effort to write bindings to a different schema. It is a future goal to provide bindings to schemas that differ from Biosql in their relational model, like Chado or Ensembl.

In summary, Bioperl-db together with Biosql adds easy-to-use, transparent, and freely available persistence to one of the, if not the most important Perl package in the life sciences, complete with schema and database loading scripts. In my talk I will present an overview on the functionality of the package, demonstrate some simple use cases, and conclude with the current status and future directions. I will also touch on the utility of Biosql as the underlying open source schema that facilitates interoperability between the Bio* projects.

Homepage: http://www.bioperl.org/

27 June, 11:40 - 12:10

Using Biopython for Laboratory Analysis Pipelines

Brad Chapman
The Plant Genome Mapping Laboratory, University of Georgia

The Biopython project is distributed collaborative effort to develop Python libraries to address the needs of researchers doing bioinformatics work. Python is an interpreted object-oriented programming language that we feel is well suited for both beginning and advanced computational researchers. Biopython has been around since 1999, and has a number of active contributors and users who continue its regular development.

One major problem in bioinformatics work is developing analysis pipelines which combine data from a number of different sources. Advanced scientific questions will require information from many disparate sources such as web pages, flat text files and relational databases. Additionally, these sources of information will often be found in different, non-compatible formats. The challenge of many researchers and software developers is to organize this information so that it can be readily queried and examined. This problem is made even more difficult by the varied and rapidly changing interests of scientists who want to ask questions with the data.

Rather then trying to build specific applications to address these data manipulation problems, Biopython has focused on developing library functionality to manipulate various data sources. This frees a researcher from having to deal with low level details of parsing and data acquisition helping to abstract the process of data conversion. Additionally, since the lower level data manipulation code is shared amongst multiple researchers, data format changes or problems with the code are more readily identified and fixed.

This talk will focus on using the Biopython libraries in developing analysis pipelines for scientific research. In addition to demonstrating the uses of Biopython, this will highlight some areas where Biopython offers unique solutions to data manipulation problems. We will identify some of the common challenges the libraries have to deal with, such as attempts to standardize output from multiple programs that perform similar function, and describe our attempts to deal with these difficulties. This will provide a foundation for both understanding the Biopython libraries and the development process underlying them.

Homepage: http://www.biopython.org/

27 June, 13:30 - 14:00

MoLabIS - A labs backbone for storing, managing, and evaluating molecular genetics data

Eildert Groeneveld Institute for Animal Science, Federal Agricultural Research Center, Mariensee, Höltystr. 10, D-31535 Neustadt, Germany, Ralf Fischer and Špela Malovrh

With increased use of molecular genotyping, sample and data management has become a major issue in molecular genetic labs. This includes description of projects, management of tissues and DNA samples and data produced by sequencers. Joint analysis of data from disparate sources implies format conversion which often has to be done manually. MoLabIS is an attempt to address theses issues in a generic way by storing primary data from disparates source in one standard format. It allows to

supports the management of samples, tissue and DNA, in storage by
- defining and documenting projects
- uniquely identifying samples within projects
- tracking samples of various kinds in storage (e.g. deep freezers)
- allocating these samples to individuals (like humans or animals) supporting any animal identification scheme
- storing additional information for each animal with different structures of traits for different projects. These may be measurements or descriptions which can be included in further analyses.
store all genotype data in one relational database for any number of projects from disparate sources by
- including original "data" (like images) together with reconciled data from different sequencers
- transforming proprietary data from disparate sources to one uniform format using filters among others from the Staden package
- storing bi-allelic data like microsatellite and sequences in one uniform format in a relational database
represents a uniform interface for data analysis. Currently, a number of measures of statistical measures of biodiversity have been included. As such it
- allows script based analysis of all data in the database
- serves as platform for implementation of other procedures as they become available in the Open Source community
centralize data management in a molecular lab using a client server infrastructure.

MoLabIS is based on the APIIS (adaptable platform independent information system) framework which has been developed for the rapid implementation of animal recording systems in agriculture. It uses exclusively Open Source software and is centered around an SQL-92 database, usually PostgreSQL. With Perl and TK-Perl it creates client server applications. Furthermore, all applications are also available in a browser version. While project data can be added via a GUI interface, molecular genetic data are loaded without human intervention from predefined directories where sequencers drop them off. All data manipulations are done using SQL. MoLabIS and APIIS is going to be released under the GPL.

27 June, 14:00 - 14:15

An Open Source Small Laboratory Information System (SLIMS)

Anton Bergheim
Department of Computer Science University of the Witwatersrand South Africa tel: + 27 11 717 6178

There exists within the scientific community of the developing world a real need for an open source small laboratory information system. As the economic reality of the developing worlds science does not permit the purchase of more sophisticated commercial systems, most smaller laboratories are vanquished to lab books and at best primitive databases. Lack of financial resources also means that these labs tend to do less "big science", preferring instead to perform a larger number of smaller more specialized experiments.

Although a large amount of work has been done to address this issue, most of it seems to cater for the larger laboratories performing more high throughput tasks (LIMaS for microarrays, a system from the Weizmann Institute of Science which caters for DNA sequencing, and SNPSnapper for SNP genotypes, amongst others). While these systems are applicable, they seem better suited for laboratories which emphasize more throughput processes with less variation in the techniques performed.

The need for a small laboratory LIMS system is one that has been recognized previously (the Open LIMS Project, the BioJava LIMS system and Gnosis being prime examples). Rather than reinvent the wheel, SLIMS will be designed to take advantage of the experiences and code (when available) of all these previous projects.

The challenge is to design a LIMS system that is both powerful and flexible while working on sparse resources with a minimal amount of training. Any LIMS system used in this environment has to be designed such that it allows the user to easily add many different types of experimental procedures rapidly. To do this a type of workflow system has to be devised or an existing one used. In this sense, the SLIMS project most closely resembles the bioJava LIMS system (SLIMS plans to utilize much of the bioJava LIMS system as well as extending it).

Preliminary studies have begun assessing user requirements. These requirements dictate a system which allows for ease of use, ease of installation, maximum data security, and extensive data, project and people tracking. The system will have to automatically integrate with existing machinery (a gel documentation system or a automated sequencer for example). Beyond storing data, and generating workflows, the SLIMS system has to allow for extensive data querying. Typical user queries can be of the sort "show me all information about sample x?" or "show me all experiments done by person y?". Some data types mentioned include sequencing gel pictures, DNA sequences, agarose gel pictures and attached annotation, autoradiographs (SSCP analysis, Southern, Western and Northern blots and DNA library probing), as well as project and people information.

One possible architecture that seems to cater for most user requirements uses a Java based GUI front end with a relational database behind it. The system will have to be designed such that it can work on a stand alone PC as well a computer network. The first version of the SLIMS will effectively be a data store and show application controlled through workflows. Later versions will have functionality allowing for automated integration with other laboratory machinery, as well as more sophisticated data and project tracking, and tight integration with bioinformatic analysis tools.

As any LIMS system is a large undertaking, it is the hope of this author that a collaboration can be formed to develop this emerging system such that it can become an invaluable tool for smaller labs throughout the developing world.

27 June, 14:15 - 14:45

FlyMine

Andrew Varley, FlyMine project, Cambridge University, UK

The FlyMine project is an open-source project to build an integrated database of genomic, expression and protein data for Drosophila and Anopheles. We aim to provide a powerful and flexible query system, with the data available for arbitrary queries via a web interface and a programming API.

The database itself is an object database built on top of PostgreSQL using the Apache OJB object/relational mapping tool, modified heavily in order to allow proper object-based queries, either using OQL or the FlyMine Query API (Java). At the underlying SQL level, the data in the tables are redundantly stored in a collection of "Precomputed tables" -- tables that are materialised views of one or more master tables. All incoming queries are automatically analysed to see if any combinations of these precomputed tables can be used to shorten the response time. This approach results in a substantial speed increase for many queries. This SQL re-writing module can be used independently of the FlyMine project to improve access to read-only SQL databases.

Remote bioinformatics users will be able to access the data using the same query API over SOAP/HTTPS to the main FlyMine servers. The data model is specified as a UML diagram, which is used to automatically generate all model-specific parts of the system: therefore the FlyMine project can easily be applied to other domains. We will also provide a graphical object query tool to make it easy for non-programmers to formulate complex arbitrary queries against the data model.

The source code will be made available under a LGPL licence around the time of BOSC and will be available on a public CVS server.

Homepage: http://www.flymine.org/

27 June, 14:45 - 15:15

Python and the Systems Biology extension module for inferring gene regulatory networks from time-course gene expression data

Michiel de Hoon, Seiya Imoto, Satoru Miyano
Laboratory of DNA Information Analysis, Human Genome Center Institute of Medical Science, University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan

Scripting languages such as Perl and Python are commonly used in bioinformatics for database access, file parsing, and sequence manipulation. Python together with Numerical Python is also very suitable for analyzing numerical data, such as gene expression data produced in cDNA microarray experiments.

Inferring gene regulatory networks from gene expression data is an important topic in bioinformatics. Since recently, dynamic Bayesian network models have been used to infer gene regulatory relations from time-course gene expression data. Software tools for dynamic Bayesian network calculations are in most cases proprietary.

We have developed the Systems Biology extension module for Python, consisting of fast-running C routines to fit noisy dynamical system models (a generalization of dynamic Bayesian networks) to time-course gene expression data. The routines allow for missing data values and can handle different time intervals between measurements; several statistical criteria are available to determine the number of transcription factors for each gene. For visualization, we made use of the Pygist scientific plotting package, which was recently ported to Windows and Mac OS X

Using this extension module, we were able to generate a highly significant validation of gene regulation by transcription factors in Bacillus subtilis from time-course gene expression data. We also predicted which sigma factors regulate the transcription of the sigY and sigV genes in Bacillus subtilis, whose regulation is currently not well understood.

The Systems Biology extension module makes use of the GNU Scientific Library, and was itself released under the GNU General Public License. It is available at http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/python.

Homepage: http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/python/systems.html

28 June, 9:00 - 9:15

The Eukaryotic Linear Motif Resource

Rune Linding, EMBL

ELM is a resource for predicting functional sites (described by linear motifs) in eukarytic proteins. Putative functional sites are identified by conventional methods, such as patterns (regular expressions) or HMM models. To improve the predictive power, context-based rules and logical filters will be developed and applied to reduce the amount of false positives.

The current version of the ELM server provides basic functionality including filtering by cell compartment and globular domain clash (using the SMART/Pfam databases). The current set of motifs is not exhaustive. The ELM resource will be regularly enhanced through 2003.

Homepage: http://ELM.eu.org/

28 June, 9:15 - 9:30

Exploring protein sequences for globularity and disorder

Rune Linding, EMBL

A major challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Non-globular sequence segments often contain short linear peptide motifs (e.g. SH3-binding sites) which are important for protein function. We present here a new tool for discovery of such unstructured, or disordered regions within proteins. GlobPlot (http://globplot.embl.de) is a web service that allows the user to plot the tendency within the query protein for order/globularity and disorder.

Homepage: http://globplot.embl.de/

28 June, 9:30 - 9:45

An online database of C-type Lectin Like Domain-containing sequences

Alex Zelensky, Jill Gready
Computational Proteomics and Therapy Design Group, Division of Molecular Bioscience, John Curtin School of Medical Research, Australian National University

While analyzing a family of C-type Lectin-like Domain (CTLD) containing proteins (CTLD-cp), we have realized that the huge volume of existing sequences and literature information (>1500 GenPept entries, thousands Medline-indexed publications) require more robust data management tools than office software commonly used by biologists. To suit our needs we have developed of a relational database, and a web interface to it, which allow storage, integration, classification and expert annotation of various kinds of biological information related to the protein family we are interested in. The product can be broadly classified as a biological content-management system, and focuses on providing high-quality biological information and a collaborative environment for its annotation.

A web interface was developed to manage MySQL-based sequence and annotation databases. Bioperl objects are used to handle rich sequence information, which is stored in a BioSQL schema based DB, while homology relationships, custom annotations and classifications, as well as web user information, are stored in a second database. DB accessor and controller Perl modules were developed for objects that are not available through BioPerl. The structure of the annotation database is phylogeny-focused and contains three principal tiers: product (translated gene product or its part, whose sequence was deposited in a protein sequence database), gene locus and GOG (group of orthologous genes from different species). Sequence collection was developed from the ground up, starting with a selection of GenPept entries containing CTLDs. After clustering redundant entries, homologous protein sequence DB entries were classified as paralogues, orthologues or alternative splicing products based on sequence similarity, genomic references and literature data. This simple bookkeeping phase allowed us to update and extend the existing CTLD-cp classifications (Drickamer 1996; Drickamer 2002), and provide a basis for further analyses, e.g. for studying alternative splicing events that are common in the CTLD family. Next, we started comparing the created catalogue of CTLD-cps to sequenced genomes (human and Fugu at the moment), which allowed us to see how well the family was studied and to find new members of established CTLD-cp groups as well as previously unknown classes of CTLD-cp.

After finishing the initial content creation and QA checks of both contents and interface, the database will be made public and the CTLD community will be invited to use and extend it. Also, though at the moment the database and software are tailored to a particular set of proteins, they can be developed into a more universal biological content-management system, which could be used by other laboratories for managing and sharing their expertise in other protein families. Source codes and schema will be readily provided to anyone interested.

Drickamer K, Fadden AJ. Genomic analysis of C-type lectins. Biochem Soc Symp. 2002;(69):59-72. PMID: 12655774
Drickamer K. Evolution of Ca(2+)-dependent animal lectins. Prog Nucleic Acid Res Mol Biol. 1993;45:207-32. PMID: 8341801

28 June, 9:45 - 10:15

Transition to Web Services in bioinformatics (driven by use cases)

Martin Senger, EBI

Web Services is a technology applicable for computationally distributed problems, which includes access to large databases. The technology is capable to deal with heterogeneous environments - but it is not too different from its predecessors, such as CORBA. The article presents the main differences between these middleware approaches (firewall-friendliness, user sessions, market forces, object views) and shows two concrete use cases of the Web Services deployed at EMBL-EBI.

Web Services

The Web Services are ubiquitously connected to XML. The universality of XML makes almost anything connected to XML a very attractive way to communicate information between programs. And that is where the Web Services are focused at - it is a technology used for distributed computing, connecting many resources available on the Internet and making them work together.

The role of XML is clearly dominant. The other parts - even though they are not mandatory they are used in most cases - are SOAP and HTTP. The SOAP specifies how to encode various data types into XML documents and how to exchange such documents in a decentralised, distributed environment. Then the HTTP carries the encoded messages between internet sites and its presence is reflected in the firewall-friendliness of the Web Services.

The important, if not the most important feature of Web Services is the language defining and describing their interfaces - the WSDL. WSDL enables to separate the description of the abstract functionality offered by a service from concrete details such as the service location and its access protocol.

Web Services versus CORBA

In previous years, many investments, both in money and human resources and skills, were made into CORBA, COM, RMI, and other technologies for distributed computing. Therefore, there is a very legitimate question what could be gained by using a new middleware in the current and new projects in bioinformatics. An executive summary answer would be that CORBA proved itself to be efficient, robust and strong inter-operable solution ideally fit in the intranets. On the other hand, the Web Services, by their design, are very suitable for using over the Internet. Last but not least, the both technologies can co-exist together in the layered architecture so one can use the strength of both.

In more detailed answer one would say that both CORBA and Web Services are software components designed to be used by programmers in the first place, and both provide connectivity in a distributed architecture. The both technologies can provide very similar features and they both can do it with reasonably same effort from the programmer's point of view. The more visible differences are:

Web Services (at least today) are easier to deploy because they regularly use the firewall-friendly port 80 (used by the HTTP carrying Web Services messages).
Web Services are quite well marketed and they have visible support from the big IT companies (Microsoft, IBM, Oracle, Sun, HP...) and from the Open Source projects as well (Apache, Perl,...).
The integration of the Web Services into workflows seems to have more choices and keeps momentum (comparing to the rival CORBA Components effort).
On the other hand, the peer-to-peer communication in Web Services is problematic. The Web Services clients use much more lightweight software and they have less privileges to go through firewalls on their sites. It requires to use different approaches for server's callbacks and other asynchronous notification services. For example, to use the SMTP protocol, or to extend the interface of a Web Service to include a negotiation of available protocols.
Also, in CORBA it is easier (at least today) to handle user sessions (because of its object references). The CORBA state-fullness is far more standardised which makes the user sessions quite inter-operable and language independent.
The CORBA is ubiquitously presented as objects where Web Services are rather oriented on messages. Therefore, a project designer must usually find a suitable technique to mimic CORBA's object references in the Web Services world. One solution is to use the string-based handlers representing the state-full objects on the server side.

The use cases implemented at EMBL-EBI

The EMBL-EBI is participating in a number of projects contributing towards the development of an e-Science and grid infrastructure for biology, many of them are based or partly based on Web Services. The use cases presented below show the symbiosis between CORBA and Web Services ("Soaplab" use case) and between CORBA adopted specification and Web Services ("BQS" use case). Both described use cases are also used in the myGrid, a multi-organisational project developing the infrastructural middleware for an "e-Biologist's" workbench, funded by EPSRC.

OpenBQS

OpenBQS provides a freely available implementation of the Bibliographic Query Service specification that was standardised and approved by the Object Management Group. The implementation includes a Web Service providing access to the MEDLINE database with more than 11 millions bibliographic citations.

This use case and its open-source implementation was reported during the BOSC 2002.

http://industry.ebi.ac.uk/openBQS

Soaplab

Soaplab is a set of Web Services providing a programatic access to some applications on remote computers. Because such applications, especially in the scientific environment usually analyze data, Soaplab is often referred to as an Analysis Web Service.

Soaplab is both a specification for an Analysis Service (based on an OMG approved specifications for sequence analyses) and its implementation. The EMBL-EBI has Soaplab service running on top of several tens of EMBOSS analyses.

Soaplab does not access individual analysis programs directly but it uses a general-purpose package AppLab that hides all details about finding, starting, controlling, and using applications programs. The AppLab uses CORBA - but it is hidden from the Soaplab users as an implementation detail. It documents how several distributed techniques can be successfully combined in a layered architecture design.

http://industry.ebi.ac.uk/soaplab

Homepage: http://industry.ebi.ac.uk/soaplab/

28 June, 10:35 - 10:50

The Phylogenetic Analysis Library project

Matthew Goode, University of Auckland) / Korbinian Strimmer, University of Munich / Alexei Drummond, University of Oxford

The Phylogenetic Analysis Library project (PAL) is a collaborative effort dedicated to provide a high quality Java library for use in molecular evolution and phylogenetics. It provides a growing Object Orientated resource for phylogenetic tree inference and analysis, including Maximum Likelihood methods. Support is included for coalescent and alignment simulation, alignment manipulation (data-type translation, bootstrapping), and statistical analysis. Future development will add, amongst other things, sequence alignment and tree searching.

Pal is released under the lGPL licence, and is available from http://www.cebl.auckland.ac.nz/pal-project/.

Reference: Drummond, A., and K. Strimmer. 2001. PAL: An object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics 17: 662-663.

Homepage: http://www.cebl.auckland.ac.nz/pal-project

28 June, 10:50 - 11:20

GMOD

To be written

Homepage: http://www.gmod.org/

28 June, 11:20 - 11:35

A new algorithm for nonmetric multidimensional scaling method

Y-h. Taguchi, Dept. Phys. Chuo University, Japan
Y. Oono, Dept. Phys., UIUC, USA

We have developed a new algorithm for the nonmetric multidimensional scaling method (nMDS). It is at present implemented in Fortran 77. The following features of our algorithm may be worth emphasizing.

Conceptually transparent: In contrast to the conventional nMDS requiring a rather artificial disparity to compute the stress to be minimized, our algorithm directly minimizes the difference between the rank order of dissimilarities and that of distances in the embedding space.
Computationally efficient: Our algorithm avoids time-consuming isotonic regression, so it is much faster than the conventional ones. The computational time is of order N²log N, where N is the number of the objects under study, because the number of iterations is usually less than 100. 3,000 objects can be handled easily by using low speed PCs (e.g., 1GHz Celeron PC with 526 MB memory).

As an application of our algorithm to bioinformatics, we present the analysis of microarray data of Caenorhabditis elegans gene expressions. The relationship among genes is visualized in a 3D space.

Homepage: http://www.granular.com/MDS/

28 June, 11:35 - 11:50

Object model and C++ modules for handling multi-platform microarray data

Andrey Ptitsyn, Pennington Biomedical Research Center, Baton Rouge, LA, USA

Microarray expression analysis is one of the most rapidly developing areas of computational biology. However, open source software for microarray analysis is still one of the least represented parts in the projects of the Open Software Foundation. One of the major challenges for the computational biologist is multiplicity of competing microarray technologies and high costs of microarray equipment. Only a handful of the world's biggest research centers can afford to support all major microarray platforms. Current state of the microarray technology vividly reminds the state of computer industry of 60s and 70s, with rapid development, revolutionary ideas, great variety of hardware and operating systems, fierce competition and little effort from the rival developers to provide platform-independence for the end users. Object-oriented programming is one of the developments of that era that helped to build platform-independent software. In our opinion, the same good old (already) designing style can mitigate the problem of incompatibility of microarray analysis software. At PBRC we have developed an object model that can handle most of the data acquired in microarray experiments and re-use the code for data analysis algorithms in multiple applications for diverse microarray platforms. The first implementation of this model is done in C++ for a number of reasons: C++ is computationally effective, highly portable, widely available, highly standardized and has a complete set of object-oriented features. Once developed, C++ modules can be re-implemented in Perl, Java or Python relatively easily, while the opposite conversion can be more problematic. One of the goals of this presentation is to establish cooperation with the OBF project developers and get some help porting this object model to other languages. The object our microarray object model has two layers. The first layer provides interface to a particular microarray platform. Currently we have implemented modules for spotted arrays, Affymetrix Genechip and Clontech Plastic Arrays. The second layer provides abstract data classes and implements various conditioning, scaling, normalization and clustering algorithms. A few applications have been developed with a help of these C++ modules, representing some of the extremely incompatible ends of microarray analysis: a) a command-line PC program for local and global linear and LOWESS normalization of spotted arrays; b) a program for LOWESS normalization of Atlas Rat Plastic Arrays with local background correction and c) a parallel multi-processor application for expression profile clustering.

28 June, 13:30 - 13:35

BIOgopher: Integrating spreadsheets with large bioinformatics databases

Joshua Phillips
SAIC Advanced Information Technology Center, Annapolis, MD
NCI Center for Bioinformatics, Gaithersburg, MD

BIOgopher is an ad hoc querying and reporting tool that enables researchers to annotate spreadsheets with the NCICB's Cancer Bioinformatics Infrastructure Objects (caBIO) data. In particular, BIOgopher presents a web-based, graphical user interface with which a user can build complex queries that incorporate data from any number of user-supplied Excel spreadsheets. The results of such queries are then delivered to the user as an Excel spreadsheet in which the user's data and the caBIO data are merged.

For example, suppose a researcher has a spreadsheet that consists of multiple columns and that one of those columns contains GenBank accession numbers. Further, suppose that the researcher would like to include in that spreadsheet information about all cellular pathways associated with each of those accession numbers. BIOgopher enables the researcher to accomplish this task without any knowledge of SQL or programming. All the user need do is indicate which column contains accession numbers and specify what pathway information should be included. BIOgopher then creates a new spreadsheet that contains the selected pathway information for each gene in addition to the researcher's original data.

BIOgopher is open source and it makes heavy use of other open source components; foremost among these being caBIO which provides a unified, powerful API to a wide range of public data sources. It also leverages Struts and POI from the Apache Software Foundation's Jakarta Project. Furthermore, BIOgopher itself was designed for reuse. Developers can reuse its components to build applications with similar functionality.

The NCICB makes BIOgopher and the caBIO interfaces available on its public servers. BIOgopher and caBIO source code is available under the caBIO open source license which can be found at http://ncicb.nci.nih.gov/core/caBIO/technical_resources/core_jar/license. Further information is available at http://ncicb.nci.nih.gov/core.

Homepage: http://cabio-prot.nci.nih.gov/BIOgopher/

28 June, 13:35 - 13:40

A Rapid Method for the Intelligent Design of Enterprise Java Beans

Chad Matsalla

The use of the Java programming language to create software components designed and deployed according to the Enterprise Java Bean specification is a widely accepted way to create reliable, modular, and distributable objects. As expected, these software components can be complex to design and deploy in a way that allows the platform independence promised by the Enterprise Java Bean specifications.

Many designers rely on development tools to guide them through the procedures necessary to create logic classes, support classes, and deployment descriptors. The use of these tools comes at a price: the designer may not actually understand the reasons for doing certain procedures and the tools often irreversibly bind the deployment of software designed with the tools to a given deployment container.

A completely open source alternative is available and provides flexible support for the creation of the support classes and deployment descriptors necessary to successfully deploy EJBs into the container of their choice. The use of Ant, XDoclet, and JBoss is presented as a rapid way to develop Enterprise Java Beans, web services, Java Servlets, and Java Server Pages.

Agriculture & Agri-Food Canada is in the process of implementing a web services framework based on objects created using this system. This will be a trans-Canadian resource used for the sharing of data and services used in Brassica, Wheat, Soybeans, and Corn genomics.

28 June, 13:40 - 13:45

BASE (BioArray Software Environment)

Carl Troein
Department of Theoretical Physics, Lund University

BASE is a free microarray database system with a clean and intuitive web interface. Designed by microarray biologists, it offers a natural work-flow for microarray experiments, all the way from microarray production to the analysis of results. The system is meant to be set up as a local repository for a microarray laboratory's data, including raw images if so desired.

BASE has been designed to work with a variety of microarray platforms, and the array LIMS part can be entered at different levels or not at all, depending on what level of information is available or relevant. This, together with flexibility in what raw data is stored in the database and support for an arbitrary number of channels in the analysis part, makes it possible to use BASE not only with 2-channel cDNA arrays, but also with Affymetrix and other platforms.

The analysis part of BASE revolves in great part around a simplistic plugin interface, where data is passed in tab-separated files. This makes it a simple task to get existing analysis tools work with BASE. A recent user-contributed plugin demonstrates how the statistics language R can be used from BASE.

The main part of BASE is written in PHP. It works against a MySQL database, and in recent versions PostgreSQL support has also appeared. We have chosen to release BASE under the GNU General Public License, and since the release of version 1.0 last year, a small community has grown around the project. In the latest version, 1.2, contributions from outside the original team have started to appear. In parallell with the development of versions 1.x, a small team has started planning and design of BASE 2.0, which will be a re-implementation in Java, drawing from the know-how gained in the creation of BASE 1.

Homepage: http://base.thep.lu.se/

28 June, 13:45 - 13:50

Biological Variation Markup Language (BVML)

Greg Tyrelle
School of Biotechnology and Biomolecular Sciences, Univ. of New South Wales

An XML implementation of a data model for personalised medicine. The tool kit makes use of Biopython and python libraries.

Homepage: http://bioinformatics.kinglab.unsw.edu.au/

28 June, 13:50 - 13:55

Nodalpoint, a bioinformatics weblog

Greg Tyrelle
School of Biotechnology and Biomolecular Sciences, Univ. of New South Wales

A weblog that integrates community moderated Pubmed citations. This is very similar to a recent proposal in Nature. The site is implemented using open source software and the EUtils libs from BioPHP.

Homepage: http://www.nodalpoint.org/

28 June, 13:55 - 14:00

BioPHP

Greg Tyrelle
School of Biotechnology and Biomolecular Sciences, Univ. of New South Wales

bioinformatics + PHP knowledge = BioPHP

Homepage: http://www.bioinformatics.org/biophp

Back to BOSC2003 home page