Clever tricks with NCBI Entrez EInfo (& Biopython)


Constructing complicated NCBI Entrez searches can be tricky, but it turns out one of the Entrez Programming Utilities called Entrez EInfo can help.

For example, suppose you want to search for mitochondrial genomes from a given taxa - either just in the Entrez web interface, for use in a script with EFetch.

I knew from past experience about using name[ORGN] in Entrez to search for an organism name - but how would you specify just mitochondria? I actually worked this out from the NCBI help and exploring the Entrez website’s advanced search - but it took a while.

There is an easier way to find out the search fields available in Entrez! Just recently I came across an interesting blog post from Neil Saunders (written a couple of weeks ago) showing how Entrez EInfo provides information about the search fields in XML format, and how you can use Ruby to process this.

Biopython can do this too of course - using Bio.Entrez this took just a few lines of Python:

`»> from Bio import Entrez

data = Entrez.read(Entrez.einfo(db=“genome”)) for field in data[“DbInfo”][“FieldList”] : … print “%(Name)s, %(FullName)s, %(Description)s” % field … ALL, All Fields, All terms from all searchable fields UID, UID, Unique number assigned to each sequence FILT, Filter, Limits the records WORD, Text Word, Free text associated with record TITL, Title, Words in definition line KYWD, Keyword, Nonstandardized terms provided by submitter AUTH, Author, Author(s) of publication JOUR, Journal, Journal abbreviation of publication VOL, Volume, Volume number of publication ISS, Issue, Issue number of publication PAGE, Page Number, Page number(s) of publication ORGN, Organism, Scientific and common names of organism, and all higher levels of taxonomy ACCN, Accession, Accession number of sequence PACC, Primary Accession, Does not include retired secondary accessions GENE, Gene Name, Name of gene associated with sequence PROT, Protein Name, Name of protein associated with sequence ECNO, EC/RN Number, EC number for enzyme or CAS registry number PDAT, Publication Date, Date sequence added to GenBank MDAT, Modification Date, Date of last update SUBS, Substance Name, CAS chemical name or MEDLINE Substance Name PROP, Properties, Classification by source qualifiers and molecule type SQID, SeqID String, String identifier for sequence GPRJ, Genome Project, Genome Project SLEN, Sequence Length, Length of sequence FKEY, Feature key, Feature annotated on sequence RTYP, Replicon type, Replicon type RNAM, Replicon name, Replicon name ORGL, Organelle, Organelle `

That gives us a list of all the fields we can currently search on in the Genome database (and you could use the same code for any of the other NCBI databases in Entrez - they probably all have different searchable fields). Very handy! The ones in bold are discussed below.

So for my particular search, using “ORGL” to filter on organelle looks sensible, and after a bit of trial and error on the website I ended up with mitochondrion[ORGL] as a useful filter (not mitochondrial, or mitochondria).

I already knew about using “ORGN” to filter on the organism, either by species name or with a suitably formatted NCBI taxon ID (which you can get by searching or browsing the Entrez taxonomy database), e.g. txid9443[ORGN] gives primates.

Putting these together, to get all the primate mitochondria in the Entrez genome database you could use:

txid9443[ORGN] AND mitochondrion[ORGL]

Note that you have to use “AND” in upper case.

I think we’ll have to add something along these lines to the Biopython Tutorial and Cookbook ( PDF)… Update: That’s done now and will be included with our next release :)

Entrez rocks! (although their documentation could use a few more examples).

Peter