Georg Fuellen
Mon, 17 Feb 1997 12:29:15 +0000 (GMT)


Thanks for the extensive feedback !

> [Note, these were copy-and-pasted from a wordprocessor, as I wrote the
> note at home.  Some charaters, like apostrophes, didn't copy right.  I
> hope no information is lost from this.]
> Wow!  I cant believe how far this code has come since the drafts I read on
> airplanes in December.  Both the Bio::Seq and Bio::Aln modules were really
> a joy to read though.  The code was typically clear, well-documented, and
> effectively implemented.  I cant thank you guys enough for doing such a
> great job.
> I spent most of my time on Bio::Seq, and will address that first.  As I
> said, the code looks really great.  I found  small number of bugs and have
> a couple of (hopefully) small design issues.  Once they are resolved, I
> think that the code can be packaged and sent to CPAN as a public beta.
> Most of the comments are written on paper and I dont have time to
> transcribe them.  Who should I mail them to (and should I send a photocopy
> to anyone else)?  Please give me a postal address.

For Bio::Seq, pls mail your comments to Chris (I don't know his snail address).
For Bio::Aln, use Georg Fuellen, Univ. Bielefeld, Techfak - AG PI,
Postfach 100131, D-33501 Bielefeld, Germany
Of course, it may turn out to be useful if you mail all your comments
about both modules to both of us.

> The main nits were:
>   [ \t] seem to be sometimes used as the opposite of \S when \s would be a

IMHO, the first line of Fasta has to match /^>[ \t]*(\S*)[ \t]*(.*)$/
and if we allow \s instead of [ \t], the first line of the actual sequence
could end up in the $head, no?! At least, I doubt that the regexp 
I developed for multiline Fasta would still work, i.e.
$ent =~ /^>[ \t]*\S*[ \t]*.*?\n(?:\n|.)*?(?=\n>|\Z)/mg
Test it with

> better option.
>   Some of the package globals use hashes rather than the arrays that they
> should

Here you need to be more specific (email or snailmail)

>   There needs to be a validly-check bit somewhere in the code; is this
> object valid or could it be bezerk?

This is not needed if there's never a need to allow the construction
of invalid objects, or permit accessors without an underscore to make an 
object invalid.  But probably there is such a need ?! So how about 
$seq->{'valid'} ? (Or, Steve Ch's stuff may be handy ?!)

>   The alphabet checking idea and implementation both seem to have problems

Yup. Your advice needed :)

>   No mechanism of undefining values was provided

You mean, accessors which turn an internal hash value to 'undef' ?

>   Several functions return strings rather than Bio::Seq objects

Which function should return Bio::Seq, except copy() ?

>   The layout functions should be autoloaded.  This can wait for later
> releases, though.
>   The numbering is limited to numerics which increment uniformly
>   The parsing code seems to be a bit schizophrenic about strings versus
> filenames

Yeah, see below.

> All of these issues can be dealt with pretty neatly, I think, except the
> last one.  The actual code for doing the parsing was quite nice.  However,
> there was the weird and poorly documented weird behavior regarding passing
> around both $ent and $filename.  This seemed like duplication and a good
> way for code to break in the future.  Moreover, right now the code is
> quite inefficient in the way it reads in the entry (doing intrinsic
> parsing) and then later tries to do the "real" parsing.  In addition to
> causing a potentially problematic time delay, it can also use enormous
> amount of memory if large files (e.g., a genome of 4.5Mb) were to be read
> in.
> I realized that this was the legacy of the way I had originally designed
> the code and that you had simply tried to work-around my approach in a
> general way.  Moreover, this half of one, half the other 
> I think that now is the time to reconsider this approach and, perhaps, to
> redesign (without, hopefully, too much re-coding) the whole file input.
> On reconsideration, I think that it would be best for the object to take
> an entirely file-based view of parsing data formats.  That is, all parsing
> will rely on the data coming from a file.  If the data isnt in a file, it
> will be written to a temporary file.  The only exceptions (at least
> initially) would be raw and fasta, which would be parsed by entirely
> different functions, separate from the ordinary parsing functions.
> I think that while redesigning the code to approach things this way, we
> should also address the issue of parsing (and writing) multiple sequences.
> This requires some sort of Bio::Seq::Set object which simply contains a
> set of sequences, probably keeping them in order (so it isnt really a set,
> but a list).  Both Bio::Seq::Set and Bio::Seq could draw on the same

Let's use ``Bio::Seq::List''! ``Set'' implies that each element _must be 
unique_. {1,1} is exactly the same as {1} in terms of set notation. 
Otherwise, it's a bag, or multiset ({1,1} != {1} in terms of bags.). 

> parsing routines, perhaps in Bio::Seq::Parse.   (This also opens a can of
> worms, because some files with indices are expensive to open and close, so
> it is worth trying to keep them open all the time if they will be
> repeatedly accessed.  I have code to deal with such files, but that could
> always be folded in later without affecting the interface to Bio::Seq.
> In order to expedite matters, I suggest that Bio::Seq::Set be given an
> entirely minimal interface initially, probably just the ability 1) to
> produce a list of all included Bio::Seqs, NOT necessarily in order and 2)
> to return a Bio::Seq with a requested id.
> What do people think of this idea?

Personally, I agree. If you really hate ``Bio::Seq::List''/``Bio::Seq::Bag''
I'd even get along with ``Bio::Seq::Set'' :-) 

> -=-=-
> Bio::Seq::Aln was also very enjoyable to read, though I was able to spend
> less time going through it in detail.  It seems simple yet powerful and
> should be a very useful tool.   It does, however, focus very strongly on
> the aspects of the alignment object which Georg required for his nucleic
> acid alignments and essentially omits entirely those features which were
> crucial for typical protein alignments.
> Some of the missing features are:
>   It has no support for non-numeric and non-contiguous numbering

Right. I won't have time to add this, so we'd need someone to take
care of numbering issues in both Bio::Seq and Bio::Aln.

>   There is no linkage between the original sequence and the alignments,
> especially for numeration.  (For example, there is no way to retrieve the
> column corresponding to the 5th residue of protein foo_bar)

Right, though %names and _select_flat can be used for this, or something like 
what Bio::Rel was intended to do (heh, not that again ;-)

>   It is inconvenient and inefficent at performing typical protein
> alignment operations, such as extracting a column as a string

Hm? What about a simple wrapper around _select_flat, or a degenerated
version of it ? 

>   No mechanism for dealing with structural alignments
> Moreover, it seems that the internal design of the code makes
> implementation of these features extremely difficult.  Lest there be any

I'm really not sure. Do you mean that storing the alignment as an
array of array references is already problematic ?
If no, what is problematic ?
I think array of array references is the most efficient if you need
rapid access to the data (using strings or Bio::Seq's would slow this
down considerably). Unless, of course, you move towards PerlDL / C plug-ins.

> doubt, I should make it clear that I dont blame Georg for this; I think
> that it would have been hard, if not impossible, to reconcile the
> different requirements.  That he only implemented the features he really
> requires is also entirely reasonable and understandable; that he
> implemented so much in the first place is totally commendable.
> However, I have to ask if a module whose basic design effectively
> precludes frequently-required operation should be permanently enshrined as
> the default alignment object.  I dont think so, but I also think that
> theres an easy solution.  Instead of calling the module Bio::Seq::Aln, why
> dont we call it Bio::Seq::GFAln (to credit Georg) or Bio::Seq::NucAln (to
> reflect its main use for nucleic acid alignments, even though it can be
> used for proteins).  Then someone else might write a module more focusing
> on protein needs called Bio::Seq::ProtAln.  Eventually, it may be possible
> to merge the objects to form a Bio::Seq::Aln.  Alternatively, one of them
> might grow dominant and its name could be changed to Bio::Seq::Aln (while
> old code would go on unaffected by the name change).
                       ^^^^^^^^^^^^I doubt that;)

Bio::NucAln would be ok, but _first_ let's check whether Bio::Aln
is really bad for Protein data. I doubt this very much !

> What do people think?  In terms of code-change, it should just mean
> changing the Makefile.PL and the package line in the module.
> As noted, I didnt spend as much time combing through the code of
> Bio::Seq::Aln.  All of my noted are on paper that Im mailing to Georg.
Looking forward !

> -=-=-
> A general note for both pieces of code is that we should assume that
> people will use the bioperl routines just like theyd use anything else
> from CPAN.  That is, they will download it and install it in the default
> way by just saying perl Makefile.PL; make all; make test; make install.
> Users should never be expected to edit the files.  Additional
> documentation about use lib is ok for people who might not have access to
> the system site_perl directory, but the INSTALLDIR notes are likely to be
> confusing and unhelpful. 
> Further, since people should not ever edit the modules, code which relies
> on external items (e.g., PGPLOT; ReadSeq) should be in a separate module
> file.  My feeling is that there should be a separate ReadSeq module which
> does nothing but provide a wrapper around the ReadSeq routines.

ok; I can give this make stuff another try, although I find it's an
unpleasant aspect of Perl, contrary to the original intent of just
having .pm files, and everything works ok and is simple...
also reflects that resentment, and the offerings -re-
``Building Related Perl Modules'' and
``Testing a module from a Perl program'' made me feel uneasy.
But as the page's author says, ``You get used to it. Really.'' Oh well.
Maybe you've got better recommandations than the ones on that page ?

> Steve
> P.S.  The Bio::Seq module on Georgs site seems to be corrupted; I couldnt

Which one ??
Actually, both are NOT on my site; Chris is offering 2 tar files, and for one 
of them, Netscape may remove the .gz without unzipping the file...
Or ? Chris, pls check !

> untar, uncompress, or gunzip it.  The one from Chriss site worked fine.
> P.P.S.  PLEASE, as soon as possible, send me the postal addresses to which
> my written comments should be sent.

See above..
best wishes,