Clever tricks with NCBI Entrez EInfo (& Biopython)
Constructing complicated NCBI Entrez searches can be tricky, but it turns out one of the Entrez Programming Utilities called Entrez EInfo can help.
For example, suppose you want to search for mitochondrial genomes from a given taxa – either just in the Entrez web interface, for use in a script with EFetch).
I knew from past experience about using name[ORGN]
in Entrez to search for an organism name – but how would you specify just mitochondria? I actually worked this out from the NCBI help and exploring the Entrez website’s advanced search – but it took a while.
There is an easier way to find out the search fields available in Entrez! Just recently I came across an interesting blog post from Neil Saunders (written a couple of weeks ago) showing how Entrez EInfo provides information about the search fields in XML format, and how you can use Ruby to process this.
Biopython can do this too of course – using Bio.Entrez this took just a few lines of Python:
>>> from Bio import Entrez
>>> data = Entrez.read(Entrez.einfo(db="genome"))
>>> for field in data["DbInfo"]["FieldList"] :
... print "%(Name)s, %(FullName)s, %(Description)s" % field
...
ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to each sequence
FILT, Filter, Limits the records
WORD, Text Word, Free text associated with record
TITL, Title, Words in definition line
KYWD, Keyword, Nonstandardized terms provided by submitter
AUTH, Author, Author(s) of publication
JOUR, Journal, Journal abbreviation of publication
VOL, Volume, Volume number of publication
ISS, Issue, Issue number of publication
PAGE, Page Number, Page number(s) of publication
ORGN, Organism, Scientific and common names of organism, and all higher levels of taxonomy
ACCN, Accession, Accession number of sequence
PACC, Primary Accession, Does not include retired secondary accessions
GENE, Gene Name, Name of gene associated with sequence
PROT, Protein Name, Name of protein associated with sequence
ECNO, EC/RN Number, EC number for enzyme or CAS registry number
PDAT, Publication Date, Date sequence added to GenBank
MDAT, Modification Date, Date of last update
SUBS, Substance Name, CAS chemical name or MEDLINE Substance Name
PROP, Properties, Classification by source qualifiers and molecule type
SQID, SeqID String, String identifier for sequence
GPRJ, Genome Project, Genome Project
SLEN, Sequence Length, Length of sequence
FKEY, Feature key, Feature annotated on sequence
RTYP, Replicon type, Replicon type
RNAM, Replicon name, Replicon name
ORGL, Organelle, Organelle
That gives us a list of all the fields we can currently search on in the Genome database (and you could use the same code for any of the other NCBI databases in Entrez – they probably all have different searchable fields). Very handy! The ones in bold are discussed below.
So for my particular search, using “ORGL” to filter on organelle looks sensible, and after a bit of trial and error on the website I ended up with mitochondrion[ORGL]
as a useful filter (not mitochondrial, or mitochondria).
I already knew about using “ORGN” to filter on the organism, either by species name or with a suitably formatted NCBI taxon ID (which you can get by searching or browsing the Entrez taxonomy database), e.g. txid9443[ORGN]
gives primates.
Putting these together, to get all the primate mitochondria in the Entrez genome database you could use:
txid9443[ORGN] AND mitochondrion[ORGL]
Note that you have to use “AND” in upper case.
I think we’ll have to add something along these lines to the Biopython Tutorial and Cookbook (PDF)… Update: That’s done now and will be included with our next release 🙂
Entrez rocks! (although their documentation could use a few more examples).
Peter