Please visit our ***NEW*** OBF/BOSC website: https://www.open-bio.org/ |
-
Google Summer of Code
The O|B|F is applying for the first time for the Google Summer of Code (GSoC) program as an umbrella organization for all O|B|F-affiliated projects.
On this page we are collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc.
Contents
News
- 08 Mar 2009: The project ideas page (the page you are looking at) is ready for adding project ideas. --Lapp
Contact
Our organization administrators are Hilmar Lapp (hlapp@gmx.net) and Mauricio Herrera Cuadra (mauricio@open-bio.org).
If you are a student interested in applying for a Google Summer of Code project with our organization, please send any questions you have, projects you would like to propose, etc to the developer mailing list of the pertinent O|B|F project.
How do you know which project is pertinent and the address of its developer mailing list? The projects under the O|B|F umbrella are listed below, with home page and developer mailing lists. Each project idea lists the O|B|F project it is a part of; look it up in the list below and you have the information you need. If you want to propose your own project idea and the project to which you would contribute isn't obvious, send email to open-bio-l@open-bio.org.
Some of us also hang out regularly on IRC, see the list of O|B|F projects below for information on which projects have a channel and the name of the channel. (If you do not have an IRC client installed, you might find the comparison on Wikipedia, the Google directory, or the IRC Reviews helpful. For Macs, X-Chat Aqua works pretty well. If you have never used IRC, try the IRC Primer at IRC Help, which also has links to lots of other material.)
For applying, please make sure you read our documentation on information that students should know and guidelines we expect you to follow before you apply. We don't have a format template for application that you need to adhere to, but we do ask that you include specific kinds of information. What those are is documented under "When you apply."
Ideas
Note: if there is more than one mentor for a project, the primary mentor is in bold font. Biographical and other information on the mentors is linked to in the Mentors section.
Students: The below are only our project ideas, albeit well thought-out ones. You are welcome to propose your own project if none of those below catches your interest, or if your idea is more exciting to you, provided it is still a contribution to one the O|B|F member projects (see list below). Just be aware that we can't guarantee finding an appropriate mentor, but if we like your proposal we will try. Regardless of what you decide to do, make sure you read and follow the guidelines for students below.
Write a NEXUS parser in C&
This is a template for how the student project ideas could be presented. Feel free to copy & paste & edit, and feel free to adjust the format.
- Rationale
- C& is an amp'ed-up programming language that has not been invented yet but in a few years will dominate the programming world. The best way to prevent broken non-compliant NEXUS parsers written in C& from appearing is to write a good one now.
- Approach
- Re-implementations of NEXUS parsers inevitably tend to be broken or non-compliant. Hence, the best approach is to write a translator that translates a reference implementation to C&.
- Challenges
- C& has not been invented yet, so a lot of assumptions will have to be made.
- Involved toolkits or projects
- The BioC& toolkit has much of the needed framework.
- Degree of difficulty and needed skills
- Hard. The hardest part is probably inventing C&. Writing the parser itself should be medium, unless C& was ill-designed for writing parsers. Knowledge of the BioC& toolkit will obviously help, as well as knowing the NEXUS format.
- Mentors
- Mike&, founder of BioC&
Write a JEE5 webservice interface to BioSQL
- Rationale
BioSQL is a intelligently designed database schema for storing sequence data and associated metadata. It does however lack any kind of user API. A sensible way to design an API for a BioSQL backed database would be to expose the API as webservices. This would allow the API to be language and database agnostic (unlike an API based on database proceedures). It would also allow data in BioSQL to be very loosely coupled into bioinformatics workflows. Once an API is in place one could even adopt modified SQL schemas underneath as long as the data access API still conforms to some specification.
- Approach
Since the development of Java EE5 (and EJB3) the development of Enterprise Java Beans that interoperate with databases and webservices is exceptionally easy. In addition Java Session Beans can be readily exported as webservices with the addition of simple annotations, often no specific configuration is required. Free and open Java app servers (such as glassfish) that provide almost all of the management middleware for object relational mapping (ORM) and webservice deployment (and a whole host of other things) are available and relatively simple to use. Finally the free and open IDE Netbeans has excellent integration with Glassfish and Java EE5 (plus I am most experienced with this IDE so I can provide more help with it's use). For these reasons I would suggest that Java EE5 is the most sensible approach to implementing this project.
During a development meeting, in Tokyo in 2008, a preliminary EJB mapping to BioSQL was generated. What remains to be done is the development of a simple, well documented and well tested API specification and implementation that bioinformatics developers can use to perform CRUD (CReate, Update, Delete) functions on the database as well as useful search and retreival operations.
In summary the project will define and document an API and expected behaivour and then implement the webservice interface. A set of unit tests will also be developed along with a proof-of-concept app that demonstrates use of the API.
- Challenges
- Designing and documenting the API so that it is simple and intuitive
- Making simple queries simple and efficient and complex queries possible.
- Making CRUD operations secure (only people with the right credentials should be able to delete the data).
- Loaders for common file types.
- [Nice to have] Making a test application that will call API methods with predefined arguments. This will let people make alternative implementations of the API while testing they are still compatible with the API. For example someone could make an entire implementation in Perl/ BioPerl and still have it validate against the API.
- Involved toolkits or projects
JavaEE5, BioSQL, parts of BioJava would be useful to steal for parsing.
- Degree of difficulty and needed skills
Medium to Hard. While the use of Java EE5 is now quite easy (esp with IDEs like Netbeans) there is quite a lot of concepts involved in the project (Webservices, ORM, EJBs etc). The hard part would be getting up to speed with those concepts. If you already know a lot of this then the project would only be medium difficulty. At minimum the student should be confident with Java and at least aware of some of the technologies. This is not the right project for a very new programmer.
- Mentors
- Mark Schreiber (and anyone else who wants to help)
Mapping the NCBI toolkit to BioPerl, BioRuby, BioConductor and BioJAVA using BioLib
- Rationale
The National Center for Biotechnology Information (NCBI) has created a large collection of utilities developed for the production and distribution of GenBank, Entrez, BLAST, and related services. To support these utilities a large set of C and C++ libraries are maintained and regularly improved by NCBI. These include, for example, sequence alignment algorithms, antigenic determinant prediction, CPG-island finder, ORF finder and string matchers. This functionality is ultimately of great interest to all scientists working in molecular biology with application in biology and biomedical research.
Unfortunately, few bioinformaticians work with C/C++. Addressing this NCBI has made a binding available for Python. This is not enough as bioinformaticians work in many different programming languages, and to be fully effective support should be made available at least for Perl, R and JAVA. These three together, probably, representing over 90% of bioinformaticians. The BioLib project successfully provides the 'mapping' infrastructure to map complex libraries against many computer languages using SWIG. Basically one mapping suffices to support all popular languages.
- Approach
Special interfaces need to be developed to map the NCBI toolkit libraries against Perl initially. The (outdated) NCBI Python mapping can be used as an initial guide for mapping functionality. Once mapped against Perl mapping against Ruby and Python is trivial. However, at this point BioLib support for R and JAVA needs to be developed. A proof-of-concept can be part of this project. Finally SWIG mappings can be used to create automated documentation and testing of BioLib code.
- Challenges
The main challenge is to provide nice and consistent interfaces in high-level languages against the NCBI C/C++ toolkit library. This requires OOP design and unit testing of existing functionality. Also some SWIG hacking may be involved to provide decent mappings for R and JAVA, as well as SWIG auto generated documentation and testing.
- Involved toolkits or projects
BioLib, BioPerl, SWIG (and optionally BioRuby, R/Bioconductor, BioJAVA or BioPython)
- Degree of difficulty and needed skills
This is a challenging project as it crosses computer languages. It requires experience in C++ and a wish for deeper understanding of at least one high-level OOP language like Perl (did I write OOP?), Python, JAVA, R or Ruby.
- Mentors
Pjotr Prins, Chris Fields
BioSQL web interface and API on Google App Engine
- Rationale
The BioSQL project provides a robust and well supported database schema for storing sequence data and associated annotations and features. It does not have a standard web interface or web facing API, both of which would provide improved access to scientific data. Deployment of BioSQL currently requires knowledge and administration of relational databases, which can hinder its use in smaller research laboratories that do not have public servers or experienced systems administrators.
This proposal seeks to bridge this gap by providing a rapidly deployable cloud based solution utilizing the established BioSQL backend. This system will allow scientists to share results in a standard format both early on during research and at the time of publication. By deploying on stable architectures, long term data access is ensured and not dependent on maintenance of local servers. Data archival for replication and expansion of ideas is an important part of the scientific process; this [http://www.portfolio.com/views/blogs/market-movers/2009/02/18/when-academic-papers-arent-replicable?tid=true recent blog review] summarizes some of the problems associated with primary data access.
- Approach
Google App Engine provides a full development stack for rapidly building and deploying web applications. The platform provides free quotas which allow a small lab with a limited budget to make their data available, and also scales for larger projects with popular data sets.
The student project expands an initial demonstration server (XXX Need to finish/ get this online; URL here) to a full featured web application. The server side implementation will be programmed in Python, utilizing the Google App Engine developers toolkit supplemented with the Biopython libraries. The client web interface will be designed using HTML, CSS and javascript; the interface will utilize a full featured javascript library, such as jQuery and jQueryUI or ExtJS. Client to server communication occurs using AJAX techniques with JSON for data exchange.
In addition to the web interface, the server will also provide a programming interface using a [http://en.wikipedia.org/wiki/Representational_State_Transfer REST] API. This involves coordination with other proposed projects, including the proposed JEE5 Java webservice, to design a common interface.
- Challenges
- Familiarizing student with Python, Javascript and AJAX, as well as the
Google App Engine environment.
- Initial implementation of BioSQL server interface with useful features.
- Coordinating input from users on the
BioSQL mailing list. The student will need to solicit desired features from users and prioritize based on implementation time and importance. See [http://lists.open-bio.org/pipermail/biosql-l/2009-January/001464.html this mailing list discussion] for an example of interest and initial ideas.
- Designing the web interface for intuitive use.
- Coordinating API development with other projects.
- Involved toolkits or projects
- Degree of difficulty and needed skills
Medium to Hard. This requires a familiarity with current web frameworks and utilizes a number of existing libraries to allow the student to jump right into the development process. This requires the interested student to be comfortable with quickly learning outside libraries. Beyond programming, the project will also involve creative thinking about interface and usability design.
- Mentors
Brad Chapman (plus...)
Mentors
- Brad Chapman (MGH; Biopython)
- Mauricio Herrera Cuadra (Yahoo! Inc.; backup org admin)
- Chris Fields (U. Illinois, Chicago; BioPerl)
- Mark Jensen (Fortinbras; BioPerl)
- Roger Hall (U. of Arkansas; BioPerl)
- Hilmar Lapp (NESCent; org admin)
- Pjotr Prins (BioLib)
- Mark Schreiber (Novartis Institute for Tropical Diseases, Singapore; BioJava)
- Joshua Udall (BioPerl)
- Jonathan Warren (Sanger Institute, UK; Biojava)
- Scooter Willis (Scripps Florida; Biojava)
- Christian Zmasek (Burnham Institute for Medical Research; BioRuby)
What should prospective students know?
Before you apply
- If you want to apply with your own idea, determine which O|B|F project you would be contributing to, and contact us early on so we can try to find a mentor.
- Our scope for proposals that we will entertain is those extend one of affiliated toolkits. Project proposals that would create a new stand-alone piece of code are outside of our scope.
- We are most interested in students who give us evidence that they have already or might develop a sustained interest in becoming future contributors to one (or more) of our projects.
- Ask us questions about the project idea you have in mind.
- Write a project proposal draft, include a project plan (see below), and bounce those off of us.
Have I mentioned yet that you should be in touch with us before you apply? The value of frequent and early communication in contributing to a distributed and collaboratively developed project can hardly be overemphasized. The same is true for becoming part of a community, even if only temporarily.
When you apply
When applying, (aside from the information requested by Google) please provide the following in your application material.
- Why you are interested in the project you are proposing, uniquely suited to undertake it, and what do you anticipate to gain from it.
- Why are you interested in contributing to the O|B|F project that your work would be (or become) a part of? To what extent and in which ways do you anticipate to stay involved with the project?
- A summary of your programming experience and skills.
- Programs or projects you have previously authored or contributed to, in particular those available as open-source, including, if applicable, any past Summer of Code involvement.
- A project plan for the project you are proposing, even if your proposed project is directly based on one of the ideas above.
- A project plan in principle divides up the whole project into a series of manageable milestones and timelines that, when all accomplished, logically lead to the end goal(s) of the project. Put in another way, a project plan explains what you expect you will need to be doing, and what you expect you need to have accomplished, at which time, so that at the end you reach the goals of the project.
- Do not take this part lightly. A compelling plan takes a significant amount of work. Empirically, applications with no or a hastily composed project plan have not been competitive, and a more thorough project plan can easily make an applicant outcompete another with more advanced skills.
- A good plan will require you to thoroughly think about the project itself and how one might want to go about the work.
- We don't expect you to have all the experience, background, and knowledge to come up with the final, real work plan on your own at the time you apply. We do expect your plan to demonstrate, however, that you have made the effort and thoroughly dissected the goals into tasks and successive accomplishments that make sense.
- We strongly recommend that you bounce your proposed project and your project plan draft off of us, using either the pertinent developers mailing list or the IRC channel(s). Through the project plan exercise you will inevitably discover that you are missing a lot of the pieces - we are there to help you fill those in as best as we can.
- Your possibly conflicting obligations or plans for the summer during the coding period.
- Although there are no hard and fast rules about how much you can do in parallel to your Summer of Code project, we do expect the project to be your primary focus of attention over the summer. If you look at your Summer of Code project as a part-time occupation, please don't apply to us.
- That notwithstanding, if you have the time-management skills to manage other work obligations concurrent with your Summer of Code project, feel encouraged to make your case and support it with evidence.
- Most important of all, be upfront. If it turns out later that you weren't clear about other obligations, at best (i.e., if your accomplishment record at that point is spotless) it destroys our trust. Also, if you are accepted, don't take on additional obligations before discussing those with your mentor.
- One of the most common reasons for students to struggle or fail is being overstretched. Don't set yourself up for that - at best it would greatly diminish the amount of fun you'll have with your Summer of Code project.
Other information
- Our [ 2009 application document] with Google's questions and our answers
- For questions of eligibility, see the GSoC eligibility requirements for students. These requirements must be met on April 20, 2009.
- There is also a Google group for posting GSoC questions (and receiving answers; note that you will need to sign up for the group) that relate to the program itself (and are not specific to our organization).
- Students receive a stipend from Google if accepted. See the Google SoC FAQ on payments for full documentation.
Reference Facts & Links
Open-Bio projects involved
- BioPerl
-
- Information for new developers
- Mailing lists
- IRC: #bioperl on Freenode
Google Summer of Code 2009
- Mentoring organizations apply between March 9-13, 2009. Accepted mentoring organizations will be published March 18. See full set of timelines.
- Google expects to accept around 150 mentoring organizations, a bit less than in 2008 (when they accepted 175). If the trend over the past years is any indication, this will be out of at least 3x as many organizations that apply.
- Students apply between March 23-April 3, 2009. The eligibility requirements for students are in the GSoC FAQ.
- Development occurs on-line, there is no requirement or expectation to travel, neither for students nor for mentors.