"EDirect for PubMed: Part 3: Formatting Results and Unix Tools" Sample Code

Below you will find sample code for the examples, in-class exercises and homework presented in the third session of the “EDirect for PubMed” Insider’s Guide class. These examples are written for use with EDirect in a Unix environment. If you need help installing and setting up EDirect, please see our “Installing EDirect” page.

For more examples, please see the sample code from the other parts of “EDirect for PubMed”:

The code below is lightly annotated to explain how it works, but if you are looking for more information, we suggest you review our EDirect documentation.

There are many different ways to answer the questions discussed in class. The sample code below provides some options, but by no means the only options. Feel free to modify, adapt, edit, re-use or completely discard any of the suggestions below when trying to find a solution that works best for you.

xtract Formatting arguments

For an introduction to xtract Formatting arguments, see the Customizing separators section of our EDirect documentation.

Change the separators in an xtract output table

We can use the -tab and -sep arguments to modify the separators in an xtract output table. We will start with a basic xtract statement with no customized separators:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID ISSN LastName

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 24102982,21171099,17150207) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The second line uses the xtract command to create a table, with one row for every PubMed record in our XML (xtract -pattern PubmedArticle), and with three columns: one for PMID (specifically, the contents of the <PMID> element that is a child of the <MedlineCitation> element), one for the journal ISSN, and one for author last name (-element MedlineCitation/PMID ISSN LastName). For articles with more than one author, we will see multiple author last names in the third column:

24102982        1742-4658       Wu      Doyle   Barry   Beauvais        Rozkalne        Piao    Lawlor    Kopin   Walsh   Gussoni
21171099        1097-4598       Wu      Gussoni
17150207        0012-1606       Yoon    Molloy  Wu      Cowan   Gussoni

By default, xtract separates columns in the output table with tabs (indicated in Unix as \t). Additionally, by default, xtract separates multiple values in the same column with tabs. So the following series of commands:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -tab "\t" -sep "\t" -element MedlineCitation/PMID ISSN LastName

produces the same output as before, since we are telling xtract to use a tab to separate between columns (-tab "\t") and between multiple values in the same column (-sep "\t"), which xtract is already doing by default:

24102982        1742-4658       Wu      Doyle   Barry   Beauvais        Rozkalne        Piao    Lawlor    Kopin   Walsh   Gussoni
21171099        1097-4598       Wu      Gussoni
17150207        0012-1606       Yoon    Molloy  Wu      Cowan   Gussoni

We can modify the output by modifying the -sep argument:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -tab "\t" -sep " " -element MedlineCitation/PMID ISSN LastName

This series of commands tells xtract to keep the separators between columns the same, but to separate multiple values in the same column (such as the multiple author last names in our third column) by spaces instead of tabs:

24102982        1742-4658       Wu Doyle Barry Beauvais Rozkalne Piao Lawlor Kopin Walsh Gussoni
21171099        1097-4598       Wu Gussoni
17150207        0012-1606       Yoon Molloy Wu Cowan Gussoni

We can further modify the output by modifying the -tab argument:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -tab "|" -sep " " -element MedlineCitation/PMID ISSN LastName

This time, the separators between columns have been changed from tabs to pipes (-tab "|"), while multiple values in the same column are still separated by spaces:

24102982|1742-4658|Wu Doyle Barry Beauvais Rozkalne Piao Lawlor Kopin Walsh Gussoni
21171099|1097-4598|Wu Gussoni
17150207|0012-1606|Yoon Molloy Wu Cowan Gussoni

The -tab and -sep arguments also allow you to specify separators of more than one character:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -tab "|" -sep ", " -element MedlineCitation/PMID ISSN LastName

This series of commands uses pipes to separate the columns (-tab "|"), but uses a comma followed by a space to separate the last names (-sep ", "):

24102982|1742-4658|Wu, Doyle, Barry, Beauvais, Rozkalne, Piao, Lawlor, Kopin, Walsh, Gussoni
21171099|1097-4598|Wu, Gussoni
17150207|0012-1606|Yoon, Molloy, Wu, Cowan, Gussoni

xtract Exploration arguments

For an introduction to xtract Exploration arguments, see the Exploration arguments section of our EDirect documentation.

Retrieve author names for a list of PubMed records

In order to retrieve the author names (including last name and initials) for all of the authors associated with each of several PubMed records, we might try to use code such as the following:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID LastName Initials

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 24102982,21171099,17150207) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The second line uses the xtract command to create a table, with one row for every PubMed record in our XML (xtract -pattern PubmedArticle), and with three columns: one for PMID (specifically, the contents of the <PMID> element that is a child of the <MedlineCitation> element), one for author last name, and one for author initials (-element MedlineCitation/PMID LastName Initials). However, the output of this series of commands is not what we expect:

24102982        Wu      Doyle   Barry   Beauvais        Rozkalne        Piao    Lawlor  Kopin   Walsh     Gussoni MP      JR      B       A       A       X       MW      AS    CA      E
21171099        Wu      Gussoni MP      E
17150207        Yoon    Molloy  Wu      Cowan   Gussoni S       MJ      MP      DB      E

The PMID appears as we expect, as does the first author last name. However, rather than following the first author’s last name with the corresponding initials, our output lists all of the authors’ last names for a PubMed record first, before listing all of the authors’ initials.

To retain the relationship between last name and initials, we could use the following series of commands:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID -block Author -element LastName Initials

The second line of this code creates a column for the PMID as before (xtract -pattern PubmedArticle -element MedlineCitation/PMID). However, the code then uses the -block argument to direct xtract to look for an <Author> element, then to look within that <Author> for <LastName> and <Initials> elements (-block Author -element LastName Initials). Because each author has only one last name and one set of initials, xtract outputs a corresponding pair of last name and initials, before moving on to find the next author. This process is then repeated for each author, giving us the output we expect:

24102982        Wu      MP      Doyle   JR      Barry   B       Beauvais        A       Rozkalne        A       Piao    X       Lawlor  MW      Kopin   AS   Walsh    CA      Gussoni E
21171099        Wu      MP      Gussoni E
17150207        Yoon    S       Molloy  MJ      Wu      MP      Cowan   DB      Gussoni E

Putting values from multiple elements in the same column

Separate author last name and initials with a space, while separating columns with a tab

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID -block Author -sep " " -element LastName,Initials

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 24102982,21171099,17150207) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The second line uses the xtract command to create a table, with one row for every PubMed record in our XML (xtract -pattern PubmedArticle), and creates a column for the PMID (xtract -pattern PubmedArticle -element MedlineCitation/PMID). As seen in previous examples, the code than uses the -block argument to direct xtract to look for an <Author> element (-block Author), then to look within that <Author> for <LastName> and <Initials> elements.

Because each author has only one last name and one set of initials, xtract outputs a corresponding pair of last name and initials, before moving on to find the next author. However, rather than putting the last name and initials in separate columns, this command uses a comma to group together both the last name and initials in the same column (-element LastName,Initials). This tells xtract to separate the last name and initials with the character we define in the -sep argument (which we have defined as a single space: -sep " "), instead of using the separator between columns (which is still the default tab), and gives us the output we desire:

24102982        Wu MP   Doyle JR        Barry B Beauvais A      Rozkalne A      Piao X  Lawlor MW       Kopin AS        Walsh CA      Gussoni E
21171099        Wu MP   Gussoni E
17150207        Yoon S  Molloy MJ       Wu MP   Cowan DB        Gussoni E

Working with files

Saving results to a file

efetch -db pubmed -id 24102982,21171099,17150207 -format xml > testfile.txt

This line of code uses the efetch command to retrieve records from PubMed (-db pubmed -id 24102982,21171099,17150207) in XML format (-format xml), and redirects the XML output to a file named “testfile.txt” (> testfile.txt).

Using a search string saved in a file to search PubMed

esearch -db pubmed -query "$(cat searchstring.txt)"

This line of code uses the esearch command to search PubMed (-db pubmed). The search query is stored in a text file (“searchstring.txt”), and the cat command is used to access the contents of the file for use as a search query (-query "$(cat searchstring.txt)"). The dollar-sign and parentheses around cat searchstring.txt indicate that Unix should use the value of cat searchstring.txt (i.e. the contents of the file “searchstring.txt”), rather than simply the words “cat searchstring.txt”.

epost

Post two PMIDs to the History server

epost -db pubmed -id 24102982,21171099

This line of code uses the epost command to post two unique identifiers (UIDs) to the History server (-id 24102982,21171099), indicating that the UIDs are for records in the PubMed database (i.e. that the UIDs are actually PMIDs; -db pubmed).

For more information about piping data from one EDirect command to another, please review the page on Making data pipelines with the History server in our EDirect overview.

Post two PMIDs to the History server and retrieve the corresponding PubMed records in abstract format

epost -db pubmed -id 24102982,21171099 | efetch -format abstract

This line of code uses the epost command to post two unique identifiers (UIDs) to the History server (-id 24102982,21171099), indicating that the UIDs are for records in the PubMed database (i.e. that the UIDs are actually PMIDs; -db pubmed). The line then pipes information to an efetch command (| efetch), which allows the efetch command to retrieve the correct PMIDs from the History server. The efetch command then retrieves the corresponding PubMed records in text abstract format (-format abstract). For more information about piping data from one EDirect command to another, please review the page on Making data pipelines with the History server in our EDirect overview.

Retrieve PubMed records in abstract format for a list of PMIDs contained in a CSV file

cat pmids.csv | epost -db pubmed | efetch -format abstract

This line of code uses cat to open a CSV file (“pmids.csv”) which contains a list of PMIDs (cat pmids.csv). Rather than displaying the contents of the file on the screen, this line of code pipes the contents of the file into an epost command (| epost). The epost command stores the PMIDs on the History server, indicating to the History server that they are PMIDs, and not UIDs from a different database (-db pubmed). Finally, the line pipes the information to an efetch command (| efetch), which allows the efetch command to retrieve the correct PMIDs from the History server. The efetch command then retrieves the corresponding PubMed records in text abstract format (-format abstract).

epost -db pubmed -input pmids.csv | efetch -format abstract

This line of code is another way of accomplishing the same task as the previous example. Rather than use cat to open the file “pmids.csv”, this line uses the epost command’s -input argument, which is a new feature of EDirect, added in version 4.90 (released on September 14, 2016). The epost command stores the PMIDs on the History server, indicating to the History server that they are PMIDs, and not UIDs from a different database (-db pubmed). Finally, the line pipes the information to an efetch command (| efetch), which allows the efetch command to retrieve the correct PMIDs from the History server. The efetch command then retrieves the corresponding PubMed records in text abstract format (-format abstract).

For more information about piping data from one EDirect command to another, please review the page on Making data pipelines with the History server in our EDirect overview.

In-class exercise solutions

Note: The first two exercises ask for an xtract command. The solutions below start with efetch commands that retrieve a sample set of PubMed records in XML, which are then piped into the xtract command. This allows us to test and verify the solutions using appropriate sample data.

Exercise 1

Write an xtract command that generates a new row for each PubMed record, and has columns for PMID, journal title abbreviation, and author-supplied keywords. Each column should be separated by “|”. Multiple keywords in the last column should be separated with commas.

Sample Output:

26359634|Elife|Argonaute,RNA silencing,biochemistry,biophysics,human,microRNA,structural biology

Solution:

efetch -db pubmed -id 26359634,24102982,28194521,27794519 -format xml | \
xtract -pattern PubmedArticle -tab "|" -sep "," -element MedlineCitation/PMID ISOAbbreviation Keyword

This xtract command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle), and with columns for PMID (specified using Parent/Child construction), journal title abbreviation, and author-supplied keywords (-element MedlineCitation/PMID ISOAbbreviation Keyword).

Instead of separating the columns by tabs, the command uses the -tab argument to specify pipe (“|”) as a separator (-tab "|"). Because each record could have multiple author-supplied keywords, the command uses the -sep argument to specify a separator between multiple values in a column (i.e. multiple author-supplied keywords in the third column; -sep ",").

Exercise 2

Write an xtract command that creates a table with a new row for each PubMed record. Each row should have the record’s PMID, as well as a list of all the MeSH headings for the records, separated by “|”. If a MeSH heading has subheadings attached, separate the heading and subheadings with “/”. For example:

24102982|Cell Fusion|Myoblasts/cytology/metabolism|Muscle Development/physiology

Solution:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -tab "|" -element MedlineCitation/PMID \
-block MeshHeading -tab "|" -sep "/" -element DescriptorName,QualifierName

This xtract command begins the same as the solution for Exercise 1 (xtract -pattern PubmedArticle). The command then specifies a separator between columns (-tab "|") and the first column in the output table (using Parent/Child construction; -element MedlineCitation/PMID). The “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

The command continues on the next line, using -block to maintain the relationship between MeSH headings and related subheadings. The -block argument directs xtract to look for a <MeshHeading> element (-block MeshHeading) then to look within that <MeshHeading> for <DescriptorName> and <QualifierName> elements. Another argument is needed to respecify the separator between columns (-tab "|"), as the separators are reset to default by -block.

Each <MeshHeading> element contains one <DescriptorName>, but may contain zero or more <QualifierName> elements. For each -block, the -element argument populates a column with the <DescriptorName> and all of the <QualifierName> elements, if there are any (-element DescriptorName,QualifierName). For MeSH headings with subheadings, this will place multiple values in the same column (one <DescriptorName> and one or more <QualifierName> elements), so we establish “/” as a separator between multiple values in the same column (-sep "/").

Exercise 3

How can we get the full XML of all articles about the relationship of Zika Virus to microcephaly in Brazil? Save your results to a file.

Solution:

esearch -db pubmed -query "zika virus microcephaly brazil" | \
efetch -format xml > zika.xml

This solution begins by using the esearch command to search PubMed (-db pubmed) for our search query (-query "zika virus microcephaly brazil"). The first line concludes by piping (|) the results of the esearch command into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The efetch command in the second line accepts the PMIDs piped from the previous line, and retrieves the PubMed records in full XML (-format xml). The results of the command is then redirected to a file (> zika.xml).

Homework solutions

Question 1

In the PubMed XML of each record, there is a <History> element, with one or more elements which provide dates for various stages in each article’s life cycle. These can include when the article was submitted to the publisher for review, when the article was accepted by the publisher for publication, when it was added to PubMed, and/or when it was indexed for MEDLINE, among others. Not all citations will include entries for each type of date.

For the following list of PMIDs

22389010,20060130,14678125,19750182,19042713,18586245

write a series of commands that retrieves each record and extracts all of these different dates, along with the labels that indicate which type of date is which.

Each PubMed record should appear on a separate line. Each line should start with the PMID, followed by a tab, followed by the list of dates. For each date, include the label, followed by a “:”, followed by the year, month and day, separated by slashes. Separate each date with a “|”.

Example output:

18586245        received:2008/01/21|revised:2008/05/05|accepted:2008/05/07|pubmed:2008/7/1|medline:2008/10/28|entrez:2008/7/1

Solution:

efetch -db pubmed -id 22389010,20060130,14678125,19750182,19042713,18586245 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block PubMedPubDate -tab ":" -sep "/" -element PubMedPubDate@PubStatus \
-tab "|" -element Year,Month,Day

Question 2

Identify your “working directory”. Write a series of commands that retrieve PubMed data, redirect the output to a file, and locate the file on your computer.

Solution:

The solution to this question may vary, depending on what type of Unix system you are using to run EDirect. One possible solution for identifying your “working directory” is:

pwd

The pwd command prints to the screen the name of your working directory. Depending on your system, this may give you all of the information you need to find your working directory. If not, please review the material presented in “EDirect for PubMed: Part 3: Formatting Results and Unix Tools”.

The second part of this question may also have many solutions. One possible solution is:

efetch -db pubmed -id 22389010,20060130,14678125,19750182,19042713,18586245 -format abstract > abstracts.txt

This solution uses a basic efetch command to retrieve the six PubMed records specified in the the text abstract format (-format abstract). The command then redirects the output to a text file (> abstracts.txt). Provided you have found your working directory, you can find your new file and open it in a text editor.

Question 3

Write a series of commands that identifies the top ten agencies that have most frequently funded published research on diabetes and pregnancy over the last year and a half. Your script should start with a search for articles about diabetes and pregnancy that were published between January 1, 2016 and June 30, 2017, should extract the agencies listed as funders on each citation, and should output a list of the ten most frequently occuring agencies. Save the results to a file.

Note: This script may take some time to run. As you build it, consider testing with small set of PubMed records, or with a search that has a smaller date range.

Solution:

esearch -db pubmed -query "diabetes AND pregnancy" -datetype PDAT -mindate 2016/01/01 -maxdate 2017/06/30 | \
efetch -format xml | \
xtract -pattern Author -sep " " -element LastName,Initials | \
sort-uniq-count-rank | \
head -n 10

This series of commands searches PubMed for the string “diabetes AND pregnancy” with a publication date between January 1, 2016 and June 30, 2017; retrieves the full XML records for each of the search results; extracts the last name and initials of every author on every record; sorts the authors by frequency of occurence in the results set; and presents the top ten most frequently-occuring authors, along with the number of times that author appeared.

esearch -db pubmed -query "diabetes AND pregnancy" -datetype PDAT -mindate 2016/01/01 -maxdate 2017/06/30 | \

The first line of this command uses esearch to search PubMed (-db pubmed) for our search query (-query "diabetes AND pregnancy"). We use the -datetype, -mindate, and -maxdate arguments to add our date restriction (-datetype PDAT -mindate 2016/01/01 -maxdate 2017/06/30). Alternatively, we could include the date restriction in our search string, as part of our -query argument.

The “|” character pipes the results of our esearch into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

efetch -format xml | \

The second line takes the esearch results from our first line and uses efetch to retrieve the full records for each of our results in the XML format (-format xml), and pipes the XML output to the next line.

xtract -pattern Author -sep " " -element LastName,Initials | \

The third line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern command indicates that we should start a new row for every author (-pattern Author). Even if there are multiple authors on a single citation, each author will be on a new line, rather than putting all authors for the same citation on the same line. The command then extracts each author’s last name and initials (-element LastName,Initials) and separates the two elements with a single space (-sep " "). This will output a list of authors’ names and initials, one author per line, and will pipe the list to the next line.

sort-uniq-count-rank | \

The fourth line uses a special EDirect function (sort-uniq-count-rank) to sort the list of authors received from the previous line, grouping together the duplicates. The function then counts how many occurrences there are of each unique author, removes the duplicate authors, and then sorts the list of unique authors by how frequently they occur, with the most frequent authors at the top. The function also returns the numerical count, making it easier to quantify how frequently each author occurs in the data set.

head -n 10

The fifth line, which is optional, shows us only the first ten rows from the output of the sort-uniq-count-rank function (head -n 10). Because this function puts the most frequently occurring authors first, this will show us only the ten most frequently occurring authors in our search results set.

Question 4

Write a PubMed search strategy and save it to a file. Write a series of commands to search PubMed using the search string contained in the file and retrieve a list of PMIDs for all records which meet the search criteria.

Solution:

The solution to this may vary, based on your strategy and the name of the file to which you save it. For this example, our search strategy is saved to a file named “searchstring.txt”.

esearch -db pubmed -query "$(cat searchstring.txt)" | \
efetch -format uid

The first line uses esearch to search PubMed (-db pubmed). The line uses the Unix command cat to read the entire contents of a file (searchstring.txt) and use it as the search query (-query "$(cat searchstring.txt)").

The “|” character pipes the results of our esearch into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

efetch -format uid

The second line takes the esearch result from our first line and uses efetch to retrieve the PMIDs for all of the records in our results set. (efetch -format uid).

Question 5

Save the following list of PMIDs in a .csv file:

22389010,20060130,14678125,19750182,19042713,18586245

Write a series of commands to retrieve the full PubMed XML records for all of the PMIDs in the file, and save the resulting XML to a .xml file.

Solution:

The solution to this may vary, based on how you choose to save your PMIDs to a file, and on the name of that file. To begin, you could save your PMIDs to a file using efetch:

efetch -db pubmed -id 22389010,20060130,14678125,19750182,19042713,18586245 -format uid > pmids.csv

Regardless of how you get the PMIDs into a .csv file, you can use epost -input and efetch to retrieve the records.

epost -db pubmed -input pmids.csv | \
efetch -format xml > records.xml

The first line of this solution uses epost to retrieve the numbers from the “pmids.csv” file (-input pmids.csv) and save them to the history server, along with the indication that the numbers are PMIDs, and refer to records in PubMed (-db pubmed)

The “|” character pipes the WebEnv and QueryKey output of our epost into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

The efetch command on the second line receiveds the WebEnv and QueryKey from the epost and uses the information to locate on the history server the specific set of PMIDs posted by our epost command. The efetch command then retrieves the full records for each of those PMIDs in full PubMed XML (-format xml), and saves the output to a new file (> records.xml).