[Open-bio-l] a common repository for test datasets/use cases for all Bio* projects
Giovanni Marco Dall'Olio
dalloliogm at gmail.com
Thu Dec 4 13:06:08 EST 2008
On 12/4/08, Jason Stajich <jason at bioperl.org> wrote:
> I don't know if this is really the best email list for this -- although not
> sure what other common list should be used.
thank you very much for the answer.
I was going to post this mail again to the other OpenBio lists to see
if there were more people interested, but I decided to wait a bit
since I am not extremely confident with these concepts myself yet and
wanted to study them a bit more.
To be honest I was going to discuss about testing with some
professional programmers I met some months ago, which seemed very
confident with the concepts of testing, to don't say that they are
For example I have it not clear how a use case could be written to be
the best useful for all the different Bio::* projects, and other
> We actually a started a project like this many moons ago, but no one
> contributed examples...
I see, thank you very much for the link.
On the biopython list they told me that a big issue is the license
with which the data is released. I don't have any inconvenience in
contributing examples with a GPL or without license, but I understand
other people could do.
Somebody told me that there were some interesting discussions on
scipy.org, but I couldn't find them.
> We can start a common SVN repository for this if you like or a github on
> OBF if that is more likely to garner contributions.
Well, to be honest I prefer git :). But it is the same for me with any
RCS system, moreover this are examples and not code so the choice will
be less important.
What I think that could be useful for this project is a ticket
tracking system, or better said a feature request system, to keep
track of all the things needed.
I used once a system called assembla:
Which seems very cool to use, but it is not open source and maybe
bugzilla would suffice.
> In terms of documentation - you are certainly welcome to make a
> documentation repository but I would argue a wiki or wiki-like soln would be
> best for documentation.
Well, basically I have three years in front of me in which I will work
in the same field (I am a first year phd student in a population
genetics laboratory) and in theory I will have to write a lot of test
cases and controls anyway, which I don't mind contributing.
However, as I was saying before I am not very experienced in writing
use cases, and it will take me a bit (let's say some months) to learn
how to write them well.
> Whether a common wiki can be maintained among the projects (or merge the
> wikifarms someday) is something to contemplate too.
I agree, for example bioperl's wiki has many useful descriptions on
file formats that the other bio*projects miss.
> On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote:
> > Hi!
> > My name is Giovanni, I come from biopython's mailing list.
> > I would like to make you a proposal.
> > Every module/program written in bioinformatics needs to be tested
> > before it can be used to produce results that can be published.
> > For example, let's say I want to write another fasta file parser, like
> > SeqIO.FastaIO in biopython : I would have have to test the script
> > against some real fasta files, just to make sure that it doesn't parse
> > them in a wrong way, or that it losts data.
> > Or, let's say I want to write a script to calculate Fst statistics
> > over some population genetics data: I will have to compare the results
> > of my scripts against other programs, check if it gives me the right
> > result for a set for which I already know the Fst value, and maybe
> > ideate some other kind of checks to be sure my script doesn't do weird
> > things, like losing input data on the way.
> > So, the point is.. what if we create a common repository for all this
> > kind of testing data, to be used in common with all the other Bio*
> > projects?
> > Wouldn't it be good if all the Bio* fasta parser are able to parse the
> > same files and give the same results, demonstrating that all of them
> > work fine or are wrong at the same time?
> > I am doing this because me (and Tiago), in the biopython mailing list,
> > like to develop a module to calculate Fst statistics over SNP data, and
> > there is no point of collecting some good test datasets and not sharing
> > with other similar projects in other programming languages.
> > The same goes for much of the documentation, like use cases: if we
> > collect a good base of use cases related to bioinformatics, it would
> > be easier to coordinate the efforts of all the Bio* projects and
> > compare the different approaches used to solve the same issue by the
> > different comunities.
> > At the moment, I have created a simple git repository on github:
> > -
> > but , it is still empty and maybe github is not the ideal hosting for
> > such a project, since the free account has a 100MB space limit.
> > --
> > My Blog on Bioinformatics (italian): http://bioinfoblog.it
> > _______________________________________________
> > Open-Bio-l mailing list
> > Open-Bio-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/open-bio-l
> Jason Stajich
> jason at bioperl.org
My blog on bioinformatics (now in English): http://bioinfoblog.it
More information about the Open-Bio-l