"EDirect for PubMed: Part 2: Extracting Data from XML" Sample Code

Below you will find sample code for the examples, in-class exercises and homework presented in the second session of the “EDirect for PubMed” Insider’s Guide class. These examples are written for use with EDirect in a Unix environment. If you need help installing and setting up EDirect, please see our “Installing EDirect” page.

For more examples, please see the sample code from the other parts of “EDirect for PubMed”:

The code below is lightly annotated to explain how it works, but if you are looking for more information, we suggest you review our EDirect documentation.

There are many different ways to answer the questions discussed in class. The sample code below provides some options, but by no means the only options. Feel free to modify, adapt, edit, re-use or completely discard any of the suggestions below when trying to find a solution that works best for you.

xtract Basics

For an introduction to the xtract command, see the xtract section of our EDirect documentation.

Retrieve the article titles for a list of PubMed records

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element ArticleTitle

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 24102982,21171099,17150207) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The second line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern argument indicates that we should start a new row in our output table for every PubMed record (-pattern PubmedArticle). The -element argument indicates that the table should include a single column, containing the article title for the given record (-element ArticleTitle).

Retrieve the PMID and year of publication for a PubMed record

In order to retrieve the PMID and the year of publication for a PubMed record, we might try to use code such as the following:

efetch -db pubmed -id 27101380 -format xml | \
xtract -pattern PubmedArticle -element PMID Year

The first line of this code uses the efetch command to retrieve a record from PubMed (-db pubmed -id 27101380) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The second line uses the xtract command to create a table, with one row for every PubMed record in our XML (xtract -pattern PubmedArticle; in this case, the table will only have a single row). The line then uses the -element argument to create two columns, one for PMID and one for Year. (-element PMID Year). However, the output of this series of commands is not what we expect:

27101380        27619336        27619799        27746956        27747057        2016    2016    2016      2016    2015    2016    2016    2016    2016

Rather than getting a single PMID and a single year, we get 5 PMIDs and 9 Years. This is because, while the -element argument is designed to create a new column for each element or attribute specified, it populates each column with the contents of every occurrence of the specified element or attribute in the -pattern. This means that if there are multiple occurrences of the <PMID> or <Year> elements in a PubMed record, the contents of all occurrences will be displayed. As a result, we see not only the PMID for the record, but also the PMIDs used to link it to other records which contain related comments or corrections. Furthermore, in addition to the publication year, we also the year for the other eight dates associated with the PubMed record.

We can avoid this by using Parent/Child construction to specify that we only want the contents of the <PMID> element that is a direct child of the <MedlineCitation> element, and that we only want the <Year> element that is a child of the <PubDate> element:

efetch -db pubmed -id 27101380 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID PubDate/Year

This version of the code gives us the output we expect:

27101380        2016

Retrieve three data elements for a list of PubMed records

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID ISOAbbreviation ArticleTitle    

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 24102982,21171099,17150207) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The second line uses the xtract command to create a table, with one row for every PubMed record in our XML (xtract -pattern PubmedArticle), and with three columns: one for PMID (specifically, the contents of the <PMID> element that is a child of the <MedlineCitation> element), one for the journal title abbreviation, and one for the article title (-element MedlineCitation/PMID ISOAbbreviation ArticleTitle).

sort-uniq-count-rank and head

Sort a list of authors by the frequency they appear in your results set

esearch -db pubmed -query "traumatic brain injury athletes" -datetype PDAT -mindate 2016 -maxdate 2017 | \
efetch -format xml | \
xtract -pattern Author -element LastName,Initials | \
sort-uniq-count-rank | \
head -n 10  

This series of commands searches PubMed for the string “traumatic brain injury athletes”, restricts results to those published in 2016 and 2017, retrieves the full XML records for each of the search results, extracts the last name and initials of every author on every record, sorts the authors by frequency of occurrence in the results set, and presents the top ten most frequently-occurring authors, along with the number of times that author appeared.

esearch -db pubmed -query "traumatic brain injury athletes" -datetype PDAT -mindate 2016 -maxdate 2017 | \

The first line of this command uses esearch to search PubMed (-db pubmed) for our search query (-query "traumatic brain injury athletes"). The line also restricts the search results to articles that were published in 2016 or 2017 (-datetype PDAT -mindate 2016 -maxdate 2017).

The “|” character pipes the results of our esearch into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

efetch -format xml | \

The second line takes the esearch results from our first line and uses efetch to retrieve the full records for each of our results in the XML format (-format xml), and pipes the XML output to the next line.

xtract -pattern Author -element LastName,Initials | \

The third line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern command indicates that we should start a new row for every author (-pattern Author). Even if there are multiple authors on a single citation, each author will be on a new line, rather than putting all authors for the same citation on the same line. The command then extracts each author’s last name and initials (-element LastName,Initials). This will output a list of authors’ names and initials, one author per line, and will pipe the list to the next line.

sort-uniq-count-rank | \

The fourth line uses a special EDirect function (sort-uniq-count-rank) to sort the list of authors received from the previous line, grouping together the duplicates. The function then counts how many occurrences there are of each unique author, removes the duplicate authors, and then sorts the list of unique authors by how frequently they occur, with the most frequent authors at the top. The function also returns the numerical count, making it easier to quantify how frequently each author occurs in the data set.

head -n 10

The fifth line, which is optional, shows us only the first ten rows from the output of the sort-uniq-count-rank function (head -n 10). Because this function puts the most frequently occurring authors first, this will show us only the ten most frequently occurring authors in our search results set:

14      Iverson     GL
11      Guskiewicz  KM
10      Meehan   WP
9       Kerr       ZY
9       Kontos   AP
9       Solomon     GS
9       Zuckerman   SL
8       Zafonte     R
7       Broglio     SP
7       Covassin    T

(Note: Your output may vary slightly, as additional citations are added to PubMed and the “most frequent” authors change.)

To show more or fewer rows, adjust the “10” up or down. If you want to see all of the authors, regardless of how frequently they appear, remove this line entirely. (If you do choose to remove this line, make sure you also remove the “|” and “\” characters from the previous line. Otherwise, the system will wait for you to finish entering your command.)

In-class exercise solutions

Note: The first three exercises ask for an xtract command. The solutions below start with efetch commands that retrieve a sample set of PubMed records in XML, which are then piped into the xtract command. This allows us to test and verify the solutions using appropriate sample data.

Exercise 1

Write an xtract command that creates a table with one row per PubMed article. Each row should have two columns: volume number and issue number.

Solution:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element Volume Issue

This xtract command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle), and with columns for volume number and issue number (-element Volume Issue).

Exercise 2

Write an xtract command that creates a table with one row per PubMed record. Each row should have three columns: PMID, journal ISSN, and citation status.

Solution:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID Journal/ISSN MedlineCitation@Status

This xtract command begins the same as the solution for Exercise 1 (xtract -pattern PubmedArticle). When creating the first column, this command uses Parent/Child construction to specify that we want the <PMID> element that is the child of the <MedlineCitation> element, and not another <PMID> element elsewhere in the PubMed record (like as a child of a <CommentsCorrections> element; -element MedlineCitation/PMID).

Similarly, the second column is also created using Parent/Child construction. This is probably not strictly necessary, as the <ISSN> element only appears in one location in the PubMed XML structure. However, this demonstrates that there may be multiple valid EDirect solutions to a given question (Journal/ISSN).

Finally, the citation status, which is found in the “Status” attribute of the <MedlineCitation> element, is placed in the third column (MedlineCitation@Status).

Exercise 3

Find out which authors have been writing about traumatic brain injuries in athletes, with publications in 2016 and 2017. The output should be a list of author names, one per line, with each author’s last name and initials.

Solution:

esearch -db pubmed -query "traumatic brain injury athletes" -datetype PDAT -mindate 2016 -maxdate 2017 | \
efetch -format xml | \
xtract -pattern Author -element LastName,Initials

This series of commands searches PubMed for the string “traumatic brain injury athletes”, restricts results to those published in 2016 and 2017, retrieves the full XML records for each of the search results, and extracts the last name and initials of every author on every record.

esearch -db pubmed -query "traumatic brain injury athletes" -datetype PDAT -mindate 2016 -maxdate 2017 | \

The first line of code uses esearch to search PubMed (-db pubmed) for our search query (-query "traumatic brain injury athletes"). The line also restricts the search results to articles that were published in 2016 or 2017 (-datetype PDAT -mindate 2016 -maxdate 2017).

The “|” character pipes the results of our esearch into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

efetch -format xml | \

The second line takes the esearch results from our first line and uses efetch to retrieve the full records for each of our results in the XML format (-format xml), and pipes the XML output to the next line.

xtract -pattern Author -element LastName,Initials

The third line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern argument indicates that we should start a new row for every author (-pattern Author). Even if there are multiple authors on a single citation, each author will be on a new line, rather than putting all authors for the same citation on the same line.

The command then extracts each author’s last name and initials (-element LastName,Initials).

Homework solutions

Question 1

Using the efetch command below to retrieve PubMed XML, write an xtract command to extract specific elements and arrange them into a table. The table should have one PubMed record per row, with columns for PMID, Journal Title Abbreviation, Publication Year, Volume, Issue and Page Numbers.

efetch -db pubmed -id 12312644,12262899,11630826,22074095,22077608,21279770,22084910 -format xml

Solution:

efetch -db pubmed -id 12312644,12262899,11630826,22074095,22077608,21279770,22084910 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID ISOAbbreviation PubDate/Year Volume Issue MedlinePgn

This xtract command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle). When creating the first column, this command uses Parent/Child construction to specify that we want the <PMID> element that is the child of the <MedlineCitation> element, and not another <PMID> element elsewhere in the PubMed record (like as a child of a <CommentsCorrections> element; -element MedlineCitation/PMID).

The second column is created without Parent/Child construction, as the <ISOAbbreviation> element is not repeated in a single PubMed XML record (ISOAbbreviation).

The third column also uses Parent/Child construction to retrieve the publication year (as opposed to other <Year> elements; PubDate/Year); the remaining elements only appear in one location in the PubMed XML structure, so Parent/Child construction is unnecessary (Volume Issue MedlinePgn).

Question 2

Create a table of the authors attached to PubMed record 28341696. The table should include each author’s last name, initials, and affiliation information (if listed).

Solution:

efetch -db pubmed -id 28341696 -format xml | \
xtract -pattern Author -element LastName Initials Affiliation

This first line of this solution uses the efetch command to retrieve a record from PubMed (-db pubmed). We specify that we will retrieve the record for PMID 28341696 (-id 28341696) and that we want the results in XML (-format xml).

xtract -pattern Author -element LastName Initials Affiliation

The second line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern argument indicates that we should start a new row for every author (-pattern Author). Even if there are multiple authors on a single citation, each author will be on a new line, rather than putting all authors for the same citation on the same line. The command then extracts each author’s last name, initials, and affiliation information (-element LastName Initials Affiliation).

Question 3

Write a series of commands to generate a table of PubMed records for review articles about the Paleolithic diet. The table should have one row per citation, and should include columns for the PMID, the citation status, and the article title.

Solution:

esearch -db pubmed -query "review[pt] paleolithic diet" | \
efetch -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID MedlineCitation@Status ArticleTitle

This series of commands searches PubMed for the string “review[pt] paleolithic diet”, retrieves the full XML records for each of the search results, and extracts the last name and initials of every author on every record.

esearch -db pubmed -query "review[pt] paleolithic diet" | \

The first line of code uses esearch to search PubMed (-db pubmed) for our search query (-query "review[pt] paleolithic diet"). Note that the search query can include search field tags ([pt]) to help focus our search, just as we can in the web version of PubMed.

The “|” character pipes the results of our esearch into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

efetch -format xml | \

The second line takes the esearch results from our first line and uses efetch to retrieve the full records for each of our results in the XML format (-format xml), and pipes the XML output to the next line.

xtract -pattern PubmedArticle -element MedlineCitation/PMID MedlineCitation@Status ArticleTitle

The third line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern argument indicates that we should start a new row for every PubMed record (-pattern PubmedArticle).

The command then extracts each record’s PMID (using Parent/Child construction; -element MedlineCitation/PMID), citation status (using “@” to retrieve the attribute value for “Status”; MedlineCitation@Status), and article title (ArticleTitle).