"EDirect for PubMed: Part 5: Developing and Building Scripts" Sample Code
As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.
Below you will find sample code for the examples, in-class exercises and homework presented in the fifth session of the “EDirect for PubMed” Insider’s Guide class. These examples are written for use with EDirect in a Unix environment. If you need help installing and setting up EDirect, please see our “Installing EDirect” page.
For more examples, please see the sample code from the other parts of “EDirect for PubMed”:
- Part 1: Getting PubMed Data
- Part 2: Extracting Data from XML
- Part 3: Formatting Results and Unix Tools
- Part 4: xtract Conditional Arguments
The code below is lightly annotated to explain how it works, but if you are looking for more information, we suggest you review our EDirect documentation.
There are many different ways to answer the questions discussed in class. The sample code below provides some options, but by no means the only options. Feel free to modify, adapt, edit, re-use or completely discard any of the suggestions below when trying to find a solution that works best for you.
Case study
Retrieve a list of articles published in between March 1, 2017 and February 28, 2018 about breast cancer that include clinical trial information from ClinicalTrials.gov. Include the PMID, journal title abbreviation, first author’s last name and initials, and ClinicalTrials.gov NCT number(s) for each record. Save the entire output to a text file.
Solution
esearch -db pubmed -query "breast cancer AND clinicaltrials.gov[si]" \
-datetype PDAT -mindate 2017/03/01 -maxdate 2018/02/28 | \
efetch -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID ISOAbbreviation \
-block Author -position first -sep " " -element LastName,Initials \
-block DataBank -if DataBankName -equals ClinicalTrials.gov \
-sep "|" -element AccessionNumber > clinicaltrials.txt
Discussion
esearch -db pubmed -query "breast cancer AND clinicaltrials.gov[si]" \
The first line of this command uses esearch
to search PubMed (-db pubmed
) for our search query (-query "breast cancer AND clinicaltrials.gov[si]"
). The “clinicaltrials.gov[si]” portion of the query ensures that only records with ClinicalTrials.gov NCT numbers are included in our results. The “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.
-datetype PDAT -mindate 2017/03/01 -maxdate 2018/02/28 | \
The second line restricts the search results to articles that were published between March 1, 2017 and February 28, 2018 (-datetype PDAT -mindate 2017/03/01 -maxdate 2018/02/28
). The “|” character pipes the results of our esearch
into our next command.
efetch -format xml | \
The third line takes the esearch
results from our first two lines and uses efetch
to retrieve the full records for each of our results in the XML format (-format xml
), and pipes the XML output to the next line.
xtract -pattern PubmedArticle -element MedlineCitation/PMID ISOAbbreviation \
Beginning on the fourth line, the xtract
command creates a table with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle
), and with four columns. The first column of each row will contain the record’s PMID, using Parent/Child construction to specify that we want the <PMID>
element that is the child of the <MedlineCitation>
element, and not another <PMID>
element elsewhere in the PubMed record (e.g. as a child of a <CommentsCorrections>
element), while the second column will contain the article’s journal title abbreviation (-element MedlineCitation/PMID ISOAbbreviation
).
-block Author -position 1 -sep " " -element LastName,Initials \
In the fifth line, xtract
looks through each PubMed record for an <Author>
element (-block Author
). When it finds the first <Author>
(-position 1
), it populates the third column in the row with the first author’s last name and initials, separated by a space (-sep " " -element LastName,Initials
).
-block DataBank -if DataBankName -equals ClinicalTrials.gov \
In the sixth line, xtract
looks through each PubMed record for <DataBank>
elements which have a child <DataBankName>
element that equals “ClinicalTrials.gov” (-block DataBank -if DataBankName -equals ClinicalTrials.gov
). This will ensure that only ClinicalTrials.gov data is included, while data from non-ClinicalTrials.gov <DataBank>
elements is excluded.
-sep "|" -element AccessionNumber > clinicaltrials.txt
In the seventh line, xtract
specifies that the fourth column should be populated with the <AccessionNumber>
(i.e. NCT number) from the included <DataBank>
elements (-element AccessionNumber
). If a record has multiple NCT numbers attached, they will be separated by pipes (-sep "|"
).
Finally, the results of the script are saved to a file (> clinicaltrials.txt
).