[Open-bio-l] a common repository for test datasets/use cases for all Bio* projects

Giovanni Marco Dall'Olio dalloliogm at gmail.com
Wed Dec 10 06:31:13 EST 2008

On 12/4/08, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Giovanni wrote:
>  > For example I have it not clear how a use case could be written to be
>  > the best useful for all the different Bio::* projects, and other
>  > things.
> In terms of use cases, I would imagine things like the following:
>  (1) Take a provided set of CDS nucleotide sequences in FASTA format,
>  translate them using NCBI codon table 11 (bacteria), and output the
>  results as a FASTA file of protein sequences.
>  (2) Take a provided set of protein sequences, and do pairwise
>  alignments between them all using the EMBOSS tool needle.
>  (3) Take a provided FASTA file of proteins, and run ClustalW on it
>  using the default settings.  Take the multiple sequence alignment in
>  ClustalW format, and covert it into Stockholm format.  Then build a
>  neighbour joining tree using quick-tree program (which cannot read in
>  ClustalW files directly).  Finally, load the tree file and produce a
>  cladogram where the taxon/leaf XXX is highlighted in red.
>  (4) Take a provided author name and keyword, and query the NCBI Entrez
>  webinterface to get a list of matching papers.  Download these
>  references (maybe as MedLine format, maybe as XML) and parse the
>  result into a CSV file for input into your reference manager (e.g.
>  EndNote - or generate a bibtex file for use with LaTeX).
>  (5) Taking a provided species name, and use NCBI Entrez to download
>  all matching EST sequences to a FASTA format file.
>  (6) Take a provided FASTA file of proteins and use standalone NCBI
>  BLASTP to search them against the NR database using a expectation
>  threshold of 10^-6 and at most ten alignments per query.  Parse the
>  results, and generate a new FASTA file of the protein sequences where
>  the description line includes the protein identifiers of closely
>  related entries found with BLAST.  [A more sensible approach to
>  automatic annotation would be nice, but more complicated]

ok, these are good examples.

I would also add a title (e.g. 1: Translating a CDS sequence), just
for convenience.

We could also add some examples of the expected outputs. Example: if
you apply the procedure on case 1 on the file COX1_cds.fasta, you
obtain exactly the file COX1_protein.fasta.

Moreover, a possible approach is to write a script that executes the
same actions described in the use cases.
I saw people doing this to test web applications (zope). They wrote
some scripts using perl's LWP or python webbrowser libraries, to make
it execute all the actions that an user can do in an use case
However this is too much work for now, better leave it for later.

>  Ideally these would all need a short motivational section explaining
>  why you might want to do this particular task.  There is probably a
>  balance between trivial and too complex.
>  These could be compiled on a shared OBF wiki, together with any input
>  files required.  It would be up to the individual projects to write
>  their own sample code to do this task - perhaps hosted on the Bio*
>  project specific wiki pages, but linked to from the use case.

Can we put it somewhere here:
- http://www.open-bio.org/wiki/Main_Page

>  Potentially this would be a huge project, but it would also be a nice
>  resource [provided it was maintained and kept up to date as the
>  toolkits evolve].  Perhaps this is too ambitious?

maybe it is :).
It will very difficult to keep up to date with everything, given the
speed at which new technologies come out nowadays.
However I think that there could be many people interested in
contributing to it. And for many researchers, it could be easier to
contribute with a description of what they want to do with their data,
rather than with code.

>  > On the biopython list they told me that a big issue is the license
>  > with which the data is released. I don't have any inconvenience in
>  > contributing examples with a GPL or without license, but I understand
>  > other people could do.
>  > Somebody told me that there were some interesting discussions on
>  > scipy.org, but I couldn't find them.
> Licensing and copyright are valid concerns.  Also the different Bio*
>  projects use different licenses - and I suspect none of them are
>  compatible with the GPL.  Any licence would have to allow all the Bio*
>  projects to copy the example files into their code with no strings
>  attached - ideally just "public domain" or MIT/BSD style.

I agree on any open license, MIT/BSD should be ok.

>  I would like to see a general collection of real world samples of each
>  file format (these could be pointed to by any shared file format
>  documentation).  Between all the Bio* projects we probably have a good
>  collection already - but the provenance of each file would have to be
>  looked at as well as the licence. In addition, artificial hand edited
>  files could be useful which include valid but unusual content to test
>  the Bio* project's parsers.  I don't think this actually needs to be
>  in a repository, but that would be nice for tracking ownership.

ok. In any case, wikis usually have a versioning system, so there are
not many differences.

>  I think it would be up to the individual projects to pull in any files
>  of interest for us in their own test suites (essentially coping
>  example files into their own repositories).
>  > What I think that could be useful for this project is a ticket
>  > tracking system, or better said a feature request system, to keep
>  > track of all the things needed.
> The OBF already runs a bugzilla installation used by most of the Bio*
>  projects, which would probably be OK for this sort of thing.

>  Peter
> _______________________________________________
>  Open-Bio-l mailing list
>  Open-Bio-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/open-bio-l


My blog on bioinformatics (now in English): http://bioinfoblog.it

More information about the Open-Bio-l mailing list