"EDirect Office Hours -- April 2018: Variables in xtract" Sample Code

As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.

Below you will find sample code for the example presented in the April 2018 EDirect Office Hours. This example is written for use with EDirect in a Unix environment. If you need help installing and setting up EDirect, please see our “Installing EDirect” page.

There are many different ways to achieve the objectives discussed in the session. The sample code below provides one option, but by no means the only option, and not even necessarily the best option. Feel free to modify, adapt, edit, re-use or completely discard any of the suggestions below when trying to find the solution that works best for you.

Find a list of the funding agencies most-commonly listed as supporting articles written by authors from a specific institution

Goal:

Find out which funding agencies have been credited on the articles produced by a specific institution (e.g. The Center for Translational Medicine at Thomas Jefferson University in Philadelphia, PA). Different authors affiliated with the same institution may refer to their institution differently (variant spellings, abbreviations, addresses, etc.), which makes precise searching by author affiliation difficult.

Solution:

esearch -db pubmed -query "translational medicine[ad] AND thomas jefferson[ad]" | \
efetch -format xml | \
xtract -pattern PubmedArticle -VAR1 MedlineCitation/PMID \
-block Affiliation -if Affiliation -contains "translational medicine" -and Affiliation -contains "thomas jefferson" \
-tab "\n" -element "&VAR1" | \
epost -db pubmed | \
efetch -format xml | \
xtract -pattern Grant -element Agency | \
sort-uniq-count-rank |
head -n 20

This series of commands searches PubMed for two strings (“translational medicine” and “thomas jefferson”) in the author affiliation data, and retrieves the full XML records for each of the search results. This search is overly broad (sensitive), as it will retrieve PubMed records that include both of those two strings in separate affiliation data for separate authors, so this code uses a series of conditions to find only cases where both strings are present in the same <Affiliation> element, and uses a variable to output the PMID of the record when those conditions are met, creating a list of PMIDs which more accurately and precisely reflects our desired results. The full XML records for each of these PMIDs are then retrieved and the funding agency for every grant listed on every record are extracted. The code then sorts the funding agencies by frequency of occurrence in the narrowed results set, and presents the top twenty most frequently-occurring agencies, along with the number of times that agency appeared.

Discussion:

esearch -db pubmed -query "translational medicine[ad] AND thomas jefferson[ad]" | \

The first line of this command uses esearch to search PubMed (-db pubmed) for our search query (-query "translational medicine[ad] AND thomas jefferson[ad]"). Our search query will retrieve every PubMed record that has the phrase “translational medicine” in the affiliation data for one of the record’s authors, and has the phrase “thomas jefferson” in the affiliation data for one of the record’s authors.

The “|” character pipes the results of our esearch into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

efetch -format xml | \

The second line takes the esearch results from our first line and uses efetch to retrieve the full records for each of our results in the XML format (-format xml), and pipes the XML output to the next line.

xtract -pattern PubmedArticle -VAR1 MedlineCitation/PMID \

The third line is the beginning of an xtract command. The -pattern argument indicates that we should start a new row for each PubMed record (-pattern PubmedArticle). However, “start a new row for each PubMed record” may be misleading. In some scripts, we use an -element MedlineCitation/PMID argument to output the PMID for each record, at the start of each line. However, this script uses a custom argument to save the PMID for the record to the variable “VAR1”, rather than outputting it immediately (-VAR1 MedlineCitation/PMID). The conditional statements on the following line mean that this xtract command might not output the contents of the “VAR1” variable (or any other data) for certain records. Additionally, due to the -tab argument on the fifth line, data from some records may be printed on several lines.

-block Affiliation -if Affiliation -contains "translational medicine" -and Affiliation -contains "thomas jefferson" \

The fourth line uses -block Affiliation to check through each <Affiliation> element on the record, one at a time, and checks whether the element contains both the phrase “translational medicine” and the phrase “thomas jefferson”. If both of these conditions are true for an <Affiliation> element, then it likely indicates an affiliation with The Center for Translational Medicine at Thomas Jefferson University, and that the record meets our search criteria. In this case, the xtract command will output data, as specified on the fifth line.

If either of these conditions is false for an <Affiliation> element, nothing will be output for that -block, and xtract will check the next <Affiliation> element on the record. If one or both of these conditions is false for every <Affiliation> element on a record, it does not meet our search criteria, and nothing will be output for the record at all. No row is created for the record, and xtract moves on to the next record.

-tab "\n" -element "&VAR1" | \

The fifth line specifies the output for any <Affiliation> elements that meet the conditions on the line above. If both conditions are met for a given -block, the contents of the “VAR1” variable (i.e. the PMID for the record) will be output (-element "&VAR1"). However, if multiple <Affiliation> elements for a record meet both conditions (such as when multiple colleagues from the Center for Translational Medicine at Thomas Jefferson University co-author an article together), the contents of “VAR1” will be output multiple times. Ordinarily, this would create a row with the same PMID printed several times (once for each <Affiliation> element which meets both conditions), separated by tabs. However, the -tab "\n" argument specifies that these PMIDs should be separated not by tabs, but by newlines (“\n”). This will ensure that each PMID is printed on a separate line, though, in the case where multiple <Affiliation> elements for a given record meet our conditions, the same PMID will be printed repeatedly on successive lines.

epost -db pubmed | \

The sixth line uses the epost command to post the PMIDs output by the xtract command to the history server. The -db pubmed argument tells the history server that the numbers being piped in from the previous line are not just numbers, but are UIDs for the PubMed database (i.e. PMIDs). The “|” at the end of the line pipes a Web Environment identifier and Query Key to the next line. Taken together, the Web Environment identifier and Query Key specify a specific set of PMIDs on the history server. The history server automatically de-duplicates the PMIDs, so only unique PMIDs are included in the set.

efetch -format xml | \

The seventh line takes the Web Environment identifier and Query Key from the previous line and uses efetch to retrieve the full records in the XML format (-format xml) for each of the PMIDs from that specific set, and pipes the XML output to the next line. This XML output should contain fewer records than our initial set (retrieved in the second line), as it only includes records where both “translational medicine” and “thomas jefferson” occur in the same <Affiliation> element.

xtract -pattern Grant -element Agency | \

The eighth line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern command indicates that we should start a new row for every grant (-pattern Grant). Even if there are multiple grants on a single citation, information for eachgrant will be on a new line, rather than putting all grants for the same citation on the same line. The command then extracts each grant’s funding agency (-element Agency). This will output a list of agencies, one agency per line, and will pipe the list to the next line.

sort-uniq-count-rank | \

The ninth line uses a special EDirect function (sort-uniq-count-rank) to sort the list of agencies received from the previous line, grouping together the duplicates. The function then counts how many occurrences there are of each unique agency, removes the duplicate agencies, and then sorts the list of unique agencies by how frequently they occur, with the most frequent agencies at the top. The function also returns the numerical count, making it easier to quantify how frequently each agency occurs in the data set.

head -n 20

The tenth line, which is optional, shows us only the first twenty rows from the output of the sort-uniq-count-rank function (head -n 20). Because this function puts the most frequently occurring agencies first, this will show us only twenty most frequently occurring agencies in our filtered results set, which should only include records for articles written by authors affiliated with the Center for Translational Medicine at Thomas Jefferson University.

Last Reviewed: August 6, 2021

The Insider's Guide to Accessing NLM Data

"EDirect Office Hours -- April 2018: Variables in xtract" Sample Code

Find a list of the funding agencies most-commonly listed as supporting articles written by authors from a specific institution

Goal:

Solution:

Discussion: