BioJava: Open Source Components for Bioinformatics

                Thomas Down and Matthew Pocock

        The Sanger Centre, Hinxton, Cambridge, CB10 1SA, UK
                       {td2,mrp}@sanger.ac.uk


The BioJava project is an open source effort to build a library of
useful routines and components for developers of bioinformatics
applications.  The project is now almost 2 years old, and has
developed rapidly since the 1.0 release in August 2000.  This year, we
passed the milestone of 10 active contributors, and are currently
working on a new major release.  BioJava is licensed under the terms
of the LGPL, and is used extensively in both academic and commercial
environments.

The current BioJava APIs are centred around manipulating, integrating,
analysing, and visualising biological sequence data.  Components in
the Sequence framework include a full range of flat-file parsers, a
client for the Distributed Annotation System, and a library for Hidden
Markov Model sequence analysis.  A separate component offers seamless
BioJava access to Ensembl sequence databases.  Over the coming year,
we hope to supplement the sequence framework with code for handling
structural and expression data.

BioJava has particularly strong links with XML technologies.  Parsers
are included for the GAME and XFF schemas.  We also use a powerful
XML-based framework for handling output from database-searching
programs such as Blast and Fasta.