[Open-bio-l] a common repository for test datasets/use cases for all Bio* projects

Giovanni Marco Dall'Olio dalloliogm at gmail.com
Thu Dec 4 13:06:08 EST 2008

On 12/4/08, Jason Stajich <jason at bioperl.org> wrote:
> I don't know if this is really the best email list for this -- although not
> sure what other common list should be used.

Hi Jason,
thank you very much for the answer.
I was going to post this mail again to the other OpenBio lists to see
if there were more people interested, but I decided to wait a bit
since I am not extremely confident with these concepts myself yet and
wanted to study them a bit more.

To be honest I was going to discuss about testing with some
professional programmers I met some months ago, which seemed very
confident with the concepts of testing, to don't say that they are

For example I have it not clear how a use case could be written to be
the best useful for all the different Bio::* projects, and other

>  We actually a started a project like this many moons ago, but no one
> contributed examples...
>  http://code.open-bio.org/cgi/viewcvs.cgi/biodata/

I see, thank you very much for the link.
On the biopython list they told me that a big issue is the license
with which the data is released. I don't have any inconvenience in
contributing examples with a GPL or without license, but I understand
other people could do.
Somebody told me that there were some interesting discussions on
scipy.org, but I couldn't find them.

>  We can start a common SVN repository for this if you like or a github on
> OBF if that is more likely to garner contributions.

Well, to be honest I prefer git :). But it is the same for me with any
RCS system, moreover this are examples and not code so the choice will
be less important.

What I think that could be useful for this project is a ticket
tracking system, or better said a feature request system, to keep
track of all the things needed.

I used once a system called assembla:
- http://www.assembla.com/spaces/biotest/tickets
Which seems very cool to use, but it is not open source and maybe
bugzilla would suffice.

>  In terms of documentation - you are certainly welcome to make a
> documentation repository but I would argue a wiki or wiki-like soln would be
> best for documentation.

Well, basically I have three years in front of me in which I will work
in the same field (I am a first year phd student in a population
genetics laboratory) and in theory I will have to write a lot of test
cases and controls anyway, which I don't mind contributing.

However, as I was saying before I am not very experienced in writing
use cases, and it will take me a bit (let's say some months) to learn
how to write them well.

>  Whether a common wiki can be maintained among the projects (or merge the
> wikifarms someday) is something to contemplate too.

I agree, for example bioperl's wiki has many useful descriptions on
file formats that the other bio*projects miss.

>  -jason
>  On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote:
> >
> > Hi!
> > My name is Giovanni, I come from biopython's mailing list.
> >
> > I would like to make you a proposal.
> > Every module/program written in bioinformatics needs to be tested
> > before it can be used to produce results that can be published.
> >
> > For example, let's say I want to write another fasta file parser, like
> > SeqIO.FastaIO in biopython : I would have have to test the script
> > against some real fasta files, just to make sure that it doesn't parse
> > them in a wrong way, or that it losts data.
> > Or, let's say I want to write a script to calculate Fst statistics
> > over some population genetics data: I will have to compare the results
> > of my scripts against other programs, check if it gives me the right
> > result for a set for which I already know the Fst value, and maybe
> > ideate some other kind of checks to be sure my script doesn't do weird
> > things, like losing input data on the way.
> >
> > So, the point is.. what if we create a common repository for all this
> > kind of testing data, to be used in common with all the other Bio*
> > projects?
> > Wouldn't it be good if all the Bio* fasta parser are able to parse the
> > same files and give the same results, demonstrating that all of them
> > work fine or are wrong at the same time?
> >
> > I am doing this because me (and Tiago), in the biopython mailing list,
> would
> > like to develop a module to calculate Fst statistics over SNP data, and
> > there is no point of collecting some good test datasets and not sharing
> them
> > with other similar projects in other programming languages.
> >
> > The same goes for much of the documentation, like use cases: if we
> > collect a good base of use cases related to bioinformatics, it would
> > be easier to coordinate the efforts of all the Bio* projects and
> > compare the different approaches used to solve the same issue by the
> > different comunities.
> >
> > At the moment, I have created a simple git repository on github:
> > -
> http://github.com/dalloliogm/bio-test-datasets-repository
> > but , it is still empty and maybe github is not the ideal hosting for
> > such a project, since the free account has a 100MB space limit.
> >
> >
> > --
> >
> -----------------------------------------------------------
> >
> > My Blog on Bioinformatics (italian): http://bioinfoblog.it
> > _______________________________________________
> > Open-Bio-l mailing list
> > Open-Bio-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/open-bio-l
> >
>  Jason Stajich
>  jason at bioperl.org


My blog on bioinformatics (now in English): http://bioinfoblog.it

More information about the Open-Bio-l mailing list