BioPerl List Summary – January 2003
As I seem to be volunteering for more and more BioPerl documentation
jobs recently, I thought I’d pool my resources and recycle some of my
tuits to write a list summary. Expect these to be sporadic and
incomplete; my goal is to highlight important questions, changes,
fixes, and proposals, not recapitulate all list traffic.
Introduction
As I seem to be volunteering for more and more BioPerl documentation
jobs recently, I thought I’d pool my resources and recycle some of my
tuits to write a list summary. Expect these to be sporadic and
incomplete; my goal is to highlight important questions, changes,
fixes, and proposals, not recapitulate all list traffic. I’ll try to
include appropriate links to specific messages, or at least to the
parent message. It’ll probably take me awhile to get good at this, so
please bear with me (and do send any suggestions).
To play a bit of catch up, I’m now going to loosely summarize the
entire month of January (leaving a few topics untouched that are
better addressed in February). February’s summary will be ready soon,
after which you’ll see more easily digestable weekly (or perhaps
bi-weekly) summaries. I’ll also be posting the HTML-ized summaries on
my O’Reilly weblog with active hyperlinks.
One item from December 31 of 2002 bears mentioning: Ewan Birney
released stable version 1.2, with significant new functionality, and
important updates to code that makes use of NCBI web services;
upgrading is highly recommended, although some of the January list
activity reflects small trials and tribulations with this release.
http://makeashorterlink.com/?S17521DA3
Questions
- Searching the mailing list archives
This seemed like an appropriate topic to put at the top of my list.
The Bioperl-l mailing list isn’t exactly as high-traffic as
perl5-porters or the linux kernel mailing list, but it is a mixture
of both deeply technical development issues and novice user
questions. While the BioPerl tutorial and documentation are the
first places one should look for answers, the second place must be
the archives of the mailing list. Brain Osborne pointed out that
“the Search box is hidden below the Thanks link at www.bioperl.org”.
It wasn’t mentioned, but the “htdig” link Hilmar Lapp pointed out
(which is also below the search box) does not actually index the
bioperl mailing list, but seems to search all other OBF-affiliated
lists (biojava, biopython, etc) …
http://users.bioperl.org/htdig/
Michal Kurowski pointed out that “the quickest way of accessing old
postings seems to be a group archive from the mailman pages” and that
“you can even download the whole thing and use it as a local mailbox”,
which happens to be very useful if you want to write list summaries.
Mailman archives are at:
http://bioperl.org/pipermail/bioperl-l/
- Bioperl 1.2 builds under cygwin
John Nash reports that he was able to build the 1.2 distribution
under cygwin once MakeMaker issues were overcome (in his case by
upgrading to perl 5.8.0). Other tips are provided:
http://makeashorterlink.com/?S23631DA3
http://makeashorterlink.com/?M27643DA3
- Getting/untarring the 1.2 distribution
Some people had trouble either FTPing the 1.2 distribution, or with
successfully untarring the tarball. These problems seemed to have
resolved by themselves, and may have been related to router issues at
the server. For the record, bioperl-1.2 can be found at:
http://www.bioperl.org/ftp/DIST/bioperl-1.2.tar.gz
- man pages with bioperl-1.2
People may have noticed that the “make” process for bioperl-1.2 does
not generate nor install man pages. Ewan Birney explains, “In 1.2 we
had to drop the manifyfication stage of the makefile because it was
triggering a line-too-long error on some OSs due to shell
constraints”. If you wish to get them back, comment (or delete) out
the MY::manifypods sub in Makefile.PL
http://makeashorterlink.com/?F10761DA3
- Converting ABI trace to Phred format
When asked why an ABI trace file read via SeqIO::abi didn’t generate a
Bio::Seq::SeqWithQuality (a sequence with associated quality values),
Aaron Mackey replied, “I’m not sure why abi.pm in the bioperl
distribution doesn’t set it’s sequence factory to SeqWithQuality”; I’m
still not sure why. See the fix at:
http://makeashorterlink.com/?H2C954DA3
- biocorba status
When asked about the status of the biocorba project, Jason Stajich
replied, “We have working bindings in java,perl,python and bridges to
the respective Bio* toolkits from these bindings for servers and
clients based on a slightly modified BSANE IDL spec from OMG”. He
qualified that statement with “none of the original developers are
using it in any of their work so development and final rounds of
testing have not really happened”
http://makeashorterlink.com/?G57A42DA3
- DNA Smith-Waterman
Yee Man has reimplemented the classic Smith-Waterman algorithm, with
algorithmic improvements as suggested by Gotoh (affine gaps) and Myers
& Miller (linear space), and wondered whether it would be a good
addition to the BioPerl C-coded extension library (which currently
contains a protein-only Smith-Waterman implementation by Ewan Birney,
pSW.pm). Some discussion about classic (and novel) dynamic
programming algorithms ensued, which eventually boiled down to a
desire to have the generic (but extremely fast) Smith-Waterman code
(written by Webb Miller) used by Bill Pearson’s SSEARCH implementation
made more widely available as a linkable C library (which BioPerl
could then subsume). Interested parties should contact me.
Relatedly, to answer one of our FAQ’s yet again, if you currently want
to do Smith-Waterman on DNA sequences, you should use BioPerl’s
bindings to the EMBOSS suite of sequence utilities.
http://makeashorterlink.com/?Y20B21DA3
http://makeashorterlink.com/?M12B23DA3
- using AUTOLOAD for get/set accessors
The BioPerl code is full of explicitly coded accessor methods; often
we are asked why we don’t use more code-efficient methods of
autogenerating these identical functions (via AUTOLOAD or
Class::MakeMethod). The discussion is long-ranging, but it boils down
to wanting every accessor to have the same functionality with respect
to undef values and return value behavior, as dictated by our accessor
“boilerplate” (which we kindly ask everyone to use). Yes, we know we
can achieve that via sophisticated Class::MakeMethod usage, but we
have bigger fish to fry at the moment. There’s another, subtler
issue about interfaces and implementation method introspection, but
I’ll leave that to a later discussion.
http://makeashorterlink.com/?Z22C32DA3
- Bio:Seq no longer a RangeI (bug in Bio::Graphics::Panel)
Much to the consternation of Lincoln Stein (and his legions of
Bio::Graphics users), BioPerl 1.2 introduced a change to Bio::Seq in
that it no longer complies with the Bio::RangeI interface; see Heikki
Lehvaslaiho’s “This has to be cruft!” message from November:
http://makeashorterlink.com/?Q53C41DA3
Unfortunately, Bio::Graphics::Panel relied on Bio::Seq having a
“start” method, so lots of existing code broke. A number of fixes
were recommended, including a) using a Bio::Seq::SeqFactory to
generate Bio::LocatableSeq’s (which do implement RangeI methods), b)
patching your Bio::Graphics::Panel and c) upgrading BioPerl 1.2 to
the live CVS development version. A BioPerl 1.2.1 is forthcoming for
this, and other reasons.
http://makeashorterlink.com/?R17C61DA3
http://makeashorterlink.com/?S3AC21DA3
- complement(join(e1, e2)) vs. join(complement(e1),
complement(e2))
Periodically, people ask “Is it possible to have bioperl output
features in Genbank format of the form
“complement(join(1..50,60..100))” rather than
“join(complement(1..50),complement(60..100))?” This time it
degenerated a little into a discussion about whether these two
representations were semantically equivalent (short answer: yes). The
answer to the original question is that BioPerl parses either
representation into the same structure, which can only be “dumped” in
one representation (presently, the latter).
http://makeashorterlink.com/?H2CC16DA3
- GenBank
bond()
FT operator
Recent GenBank files have begun to exhibit a new feature location
operator, “bond”, to identify dicysteine bonds in proteins and mRNA
splice sites in RefSeq sequences. BioPerl has no concept of this
location operator (which is really more of a feature, and would
be better represented as a /bond feature table entry), and so
currently dies when parsing a record containing it. A brute force
fix is provided, but a better answer is yet to appear:
http://makeashorterlink.com/?L24D12DA3
Changes/Additions
- SearchIO now has megablast parser
Jason Stajich writes, “The oft requested megablast parser has now been
implemented in SearchIO”. This should be available in the upcoming
bioperl 1.2.1 bug-fix release, as well as CVS.
http://makeashorterlink.com/?O29D62DA3
- bl2seq parser needs to know report type to get strand right
No matter how hard he tried, Dave Arenillas couldn’t retrieve HSP
strand information from a bl2seq (BLAST two sequences against each
other) report. After “a little bit of detective work”, Jason Stajich
found that the Bio::Tools::BPbl2seq report object needs to be told
the program type (e.g. “blastn”) since it’s not smart enough to guess
it by context alone. The patch to BPbl2seq.pm is available via CVS.
http://makeashorterlink.com/?M2ED52DA3
- bioperl.rpm in biolinux.org distribution
Marc Logghe reports that “A couple of friends of mine have started up
www.biolinux.org [ … and] are offering a number of rpm packages for
free download like e.g. emboss, sim4, phylip, ncbiblast, …”. After
some discussion about what a bugbear packaging BioPerl can be (most
dependencies are not critical for the entire package, only certain
subparts that may or may not be useful for a given person), Hunter
Matthews chimed in that he’d likely be able to make a bioperl 1.2 rpm
(he had previously made a 1.0 rpm); Hunter adds “7.3 would be the most
likely target platform”. Marc Logghe later reported that “the bioperl
rpm’s for RedHat and Suse are on line at http://biolinux.org/bioperl.html”
http://makeashorterlink.com/?S16E22DA3
- example scripts reorganization for installation as “production” code
Spurred on by an earlier conversation regarding the perl scripts
scattered between examples/ and scripts/, Brian Osborne has taken up
the challenge to reorganize and reshape these so that those functional
scripts with adequate POD and a .PLS suffix all live in scripts/ and
get installed for “production” use. Scripts should remain in
examples/ if they are simply “proof-of-concept” code, or just poorly
documented. Everyone liked this, and from the CVS activity, it looks
like the work is progressing.
http://makeashorterlink.com/?R29E32DA3
- Bio::Seq::SequenceTrace
Chad Matsalla has added a Bio::Seq::SequenceTrace object, to “mimic
the information available in a scf ‘Sequence Chromatogram File”‘. It
slices and dices. In the process, Chad ended up rewriting his
Bio::SeqIO::scf code, “because the old module was somewhat …
clunky”.
http://makeashorterlink.com/?R3AE25DA3
- MLAGAN/LAGAN support
Stephen Montgomery has supplied both MLAGAN and LAGAN wrappers and
parsers (the Lagan Tookit is a set of alignment programs for
comparative genomics).
http://makeashorterlink.com/?N11F24DA3
Fixes
- SeqIO/scf.pm bug
Tony Cox “finally got around to checking in a fix for the SeqIO/scf
module when it has to deal with 8-bit encoded trace data”. It’s not
yet clear where this fix stands with Chad Matsalla’s rewrite of
Bio/SeqIO/scf.pm
- Bio::Tools::Run::WrapperBase.pm missing from 1.2
Because of some code migration between bioperl subprojects,
Bio/Tools/Run/WrapperBase.pm went missing in the 1.2 release, causing
a wide variety of failures. The 1.2.1 release will address this, or
you can retrieve the missing file and install it manually from here:
http://makeashorterlink.com/?B20015EA3
- bug fixes in Blast HSP tiling code
After finding that for certain BLAST reports the “blast” and
“psiblast” SearchIO parsers gave mildly differing values for
“frac_identical” and “frac_conserved”, Jason Stajich did some
auditing of the HSP tiling code and found a few inconsistencies which
have since been fixed. End result: frac_identical and frac_conserved
should be better behaved and actually correct.
http://makeashorterlink.com/?V26053EA3
Proposals
- Project ideas for the aspiring biohacker
Periodically, we’re asked “I’d like to get involved, do you have any
project ideas a newbie could work on?”. Jason Stajich shot out a few
choice ideas including “blastz” SearchIO parsing, which was briefly
discussed. Get involved!
http://makeashorterlink.com/?D28022EA3
- Bio::Perl namespace export groups
The Bio::Perl module is a top-level, “novice” interface to a few small
tidbits of BioPerl functionality. Many first-time users appreciate
the simplicity of the Bio::Perl interface, and so we’d like to extend
it’s reach into other “meaty” areas of BioPerl functionality. Here we
talk about how we might achieve this using custom export tags (a la
CGI.pm and others). Another area where someone could make a dramatic
impact without writing any new modules!
http://makeashorterlink.com/?D4A025EA3
Conclusion
Well, that’s it for this installment. Stay tuned for February (a
much busier month!).