"EDirect for PubMed: Part 4: xtract Conditional Arguments" Sample Code

As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.

Below you will find sample code for the examples, in-class exercises and homework presented in the fourth session of the “EDirect for PubMed” Insider’s Guide class. These examples are written for use with EDirect in a Unix environment. If you need help installing and setting up EDirect, please see our “Installing EDirect” page.

For more examples, please see the sample code from the other parts of “EDirect for PubMed”:

The code below is lightly annotated to explain how it works, but if you are looking for more information, we suggest you review our EDirect documentation.

There are many different ways to answer the questions discussed in class. The sample code below provides some options, but by no means the only options. Feel free to modify, adapt, edit, re-use or completely discard any of the suggestions below when trying to find a solution that works best for you.

xtract Conditional arguments
Combining multiple Conditional arguments
xtract and the -position argument
- Include only the First Author in the output table
Dealing with blanks
- Specify a placeholder to replace blank spaces in the output table
In-class exercise solutions
Homework solutions

xtract Conditional arguments

For an introduction to the xtract Conditional arguments, see the Filtering output with Conditional arguments section of our EDirect documentation.

Include only authors with ORCID IDs in the output table

efetch -db pubmed -id 27460563,27298442,27392493,27363997,27298443 -format xml | \
xtract -pattern Author -if Identifier -sep " " -element LastName,Initials Identifier

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 27460563,27298442,27392493,27363997,27298443) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The second line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern argument indicates that we should start a new row for each <Author> element (-pattern Author), but only if the <Author> element contains an <Identifier> element (which is the where an author’s ORCID ID is stored). If an author does not have an ORCID ID, the author will not have an <Identifier> element; no row is created for the author, and xtract skips to the next author (-if Identifier).

The command creates two columns for each row: one with the author’s last name and initials (-element LastName,Initials) separated with a single space (-sep " "), and one with the author’s ORCID ID (Identifier).

Include only articles from the journal JAMA in the output table

efetch -db pubmed -id 27460563,27532912,27392493,27363997,24108526 -format xml | \
xtract -pattern PubmedArticle -if ISOAbbreviation -equals JAMA -element Volume Issue

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 27460563,27532912,27392493,27363997,24108526) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The command creates two columns for each row: one with the article’s Volume number, one with the article’s Issue number (-element Volume Issue).

Include only MEDLINE articles in the output table

efetch -db pubmed -id 27460563,27532912,27392493,27363997,24108526 -format xml | \
xtract -pattern PubmedArticle -if MedlineCitation@Status -equals MEDLINE -element MedlineCitation/PMID

The command creates a single column for each row, containing the record’s PMID (specifically, the contents of the <PMID> element that is a child of the <MedlineCitation> element; -element MedlineCitation/PMID)

Include only authors whose affiliation mentions Japan in the output table

efetch -db pubmed -id 27460563,27532912,27392493,27363997,24108526 -format xml | \
xtract -pattern Author -if Affiliation -contains Japan -sep " " -element LastName,Initials Affiliation

The second line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern argument indicates that we should start a new row for each <Author> element (-pattern Author), but only if the <Author> element contains an <Affiliation> element which includes the word “Japan”. If an author does not have affiliation data, or the author’s affiliation data does not contain Japan, no row is created for the author, and xtract skips to the next author (-if Affiliation -contains Japan).

Output a list of PMIDs and corresponding DOIs

efetch -db pubmed -id 16940437,16049336,11972038 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block ArticleId -if ArticleId@IdType -equals doi -element ArticleId

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 16940437,16049336,11972038) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

Beginning on the second line, the xtract command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle). The first column of each row will contain the record’s PMID, using Parent/Child construction to specify that we want the <PMID> element that is the child of the <MedlineCitation> element, and not another <PMID> element elsewhere in the PubMed record (e.g. as a child of a <CommentsCorrections> element; -element MedlineCitation/PMID).

The xtract command continues on the third line by checking each <ArticleId> element in a PubMed record (-block ArticleId). If an <ArticleId> element contains a DOI (indicated by the “IdType” attribute for the <ArticleId> equaling “doi”; -if ArticleId@IdType -equals doi), then the command puts the DOI in the second column (-element ArticleId). If not, the second column is left blank.

The result of this command will be a two column table, where the first column is always a PMID, and the second column is either the corresponding DOI (if there is one), or is blank (if there is no DOI).

Combining multiple Conditional arguments

Output a list of PMIDs and corresponding DOIs and PMCIDs

efetch -db pubmed -id 16940437,16049336,11972038 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block ArticleId -if ArticleId@IdType -equals doi \
-or ArticleId@IdType -equals pmc -element ArticleId

The xtract command continues on the third line by checking each <ArticleId> element in a PubMed record (-block ArticleId). If an <ArticleId> element contains a DOI (indicated by the “IdType” attribute for the <ArticleId> equaling “doi”; -if ArticleId@IdType -equals doi) OR a PMC ID (indicated by the “IdType” attribute for the <ArticleId> equaling “pmc”; -or ArticleId@IdType -equals pmc), then the command puts the contents of the <ArticleId> element in the second column (-element ArticleId).

Because a PubMed record can have multiple <ArticleId> elements, and because the -block argument checks each <ArticleId> separately, this command may result in both a DOI and a PMC ID appearing the second column of some rows.

Include only authors with the last name Kamal and with affiliation data in the output table

efetch -db pubmed -id 27798514,24372221,24332497,24307782 -format xml | \
xtract -pattern Author -if LastName -equals Kamal -and Affiliation \
-sep " " -element LastName,Initials Affiliation

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 27798514,24372221,24332497,24307782) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The second line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern argument indicates that we should start a new row for each <Author> element (-pattern Author), but only if the <LastName> element for the <Author> is “Kamal”, AND the <Author> element contains an <Affiliation> element (-if LastName -equals Kamal -and Affiliation). If an author’s last name is not “Kamal” or the author does not have affiliation data, no row is created for the author, and xtract skips to the next author.

Include only PubMed records indexed with the MeSH heading “Microcephaly”, and with any MeSH heading containing the words “Zika Virus” in the output table

efetch -db pubmed -id 27582188,27417495,27409810,27306170,18142192 -format xml | \
xtract -pattern PubmedArticle -if DescriptorName -contains "Zika Virus" \
-and DescriptorName -equals Microcephaly -element MedlineCitation/PMID ArticleTitle

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 27582188,27417495,27409810,27306170,18142192) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

The second line uses the xtract command to retrieve only the elements we need from the XML output, and display those elements in a tabular format. The -pattern argument indicates that we should start a new row for each PubMed record (-pattern PubmedArticle), but only if the record has a <DescriptorName> element that contains the words “Zika Virus” (-if DescriptorName -contains "Zika Virus"), AND a <DescriptorName> element that equals “Microcephaly” (-and DescriptorName -equals Microcephaly). If a record does not have MeSH headings assigned that meet those criteria, no row is created for the author, and xtract skips to the next author. Note that, because of the use of -contains, both the MeSH heading “Zika Virus” and the MeSH heading “Zika Virus Infection” will satisfy the first condition in this command.

The command creates two columns for each row: one with the record’s PMID (specifically, the contents of the <PMID> element that is a child of the <MedlineCitation> element), one with the article’s title (-element MedlineCitation/PMID ArticleTitle).

xtract and the -position argument

Include only the First Author in the output table

efetch -db pubmed -id 28594955,28594944,28594945,28594943,28594948,28594957 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block Author -position first -sep " " -element LastName,Initials

The first line of this code uses the efetch command to retrieve records from PubMed (-db pubmed -id 28594955,28594944,28594945,28594943,28594948,28594957) in XML format (-format xml), and concludes by piping (|) the resulting XML into a command on the next line (the “\” character at the end of the line allows us to continue our command on the next line, for easier-to-read formatting).

In the third line, xtract looks through each PubMed record for an <Author> element (-block Author). When it finds the first <Author> (-position first), it populates the second column in the row with the first author’s last name and initials, separated by a space (-sep " " -element LastName,Initials).

Dealing with blanks

Specify a placeholder to replace blank spaces in the output table

efetch -db pubmed -id 28594955,28594944,28594945,28594943,28594948,28594957 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block Author -position first -sep " " -def "N/A" -element LastName,Initials Identifier

This series of commands is largely the same as the “Include only the First Author in the output table” example presented above. However, in the third line, we have added the -def argument to specify the placeholder value (“N/A”) for any blank cells in the output table (-def "N/A").

In-class exercise solutions

Note: The first two exercises ask for an xtract command. The solutions below start with efetch commands that retrieve a sample set of PubMed records in XML, which are then piped into the xtract command. This allows us to test and verify the solutions using appropriate sample data.

Exercise 1

Write an xtract command that creates a table with one row per PubMed record, but that only includes PubMed records if they have MeSH headings. Each row should have two columns: PMID and citation status.

Solution:

efetch -db pubmed -id 26277396,29313986,19649173,21906097,25380814 -format xml | \
xtract -pattern PubmedArticle -if MeshHeading -element MedlineCitation/PMID MedlineCitation@Status

This xtract command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle), but only if the record contains a <MeshHeading> element (-if MeshHeading). If a record does not have MeSH headings attached, no row is created for the record, and xtract skips to the next record.

For each row in the output, xtract creates two columns: one with the record’s PMID (specifically, the contents of the <PMID> element that is a child of the <MedlineCitation> element), one with the record’s citation status (-element MedlineCitation/PMID MedlineCitation@Status).

Exercise 2

Write an xtract command that creates a table with one row per PubMed record, but that only includes PubMed records for articles published in one of the JAMA journals (e.g. JAMA cardiology, JAMA oncology, etc.). Each row should have two columns: PMID and journal title abbreviation.

Solution:

efetch -db pubmed -id 27829097,27829076,19649173,21603067,25380814 -format xml | \
xtract -pattern PubmedArticle -if ISOAbbreviation -starts-with JAMA -element MedlineCitation/PMID ISOAbbreviation

This xtract command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle), but only if the record has a journal title abbreviation that begins with “JAMA” (-if ISOAbbreviation -starts-with JAMA).

For each row in the output, xtract creates two columns: one with the record’s PMID (specifically, the contents of the <PMID> element that is a child of the <MedlineCitation> element), one with the article’s journal title abbreviation (-element MedlineCitation/PMID ISOAbbreviation).

Exercise 3

Write a series of commands that generates a list of the different affiliation data used by author BH Smith between 2012 and 2017. The script should output the PMID for each article published by BH Smith in that time frame, along with the BH Smith’s affiliation data for each article.

Solution:

esearch -db pubmed -query "smith bh[Author]" -datetype PDAT -mindate 2012 -maxdate 2017 | \
efetch -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block Author -if LastName -equals Smith -and Initials -equals BH -element LastName,Initials Affiliation

This series of commands searches for publications by the author BH Smith that were published between 2012 and 2017, retrieves the full XML records for each of the search results, extracts the PMID and BH Smith’s affiliation data from each record, and displays the results in a table.

esearch -db pubmed -query "smith bh[Author]" -datetype PDAT -mindate 2012 -maxdate 2017 | \

The first line of code uses esearch to search PubMed (-db pubmed) for articles where “smith bh” is the author (-query "smith bh[Author]"). The line also restricts the search results to articles that were published between 2011 and 2016 (-datetype PDAT -mindate 2012 -maxdate 2017).

The “|” character pipes the results of our esearch into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

efetch -format xml | \

The second line takes the esearch results from our first line and uses efetch to retrieve the full records for each of our results in the XML format (-format xml), and pipes the XML output to the next line.

xtract -pattern PubmedArticle -element MedlineCitation/PMID \

Beginning on the third line, the xtract command creates a table with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle), and with three columns. The first column of each row will contain the record’s PMID, using Parent/Child construction to specify that we want the <PMID> element that is the child of the <MedlineCitation> element, and not another <PMID> element elsewhere in the PubMed record (e.g. as a child of a <CommentsCorrections> element).

-block Author -if LastName -equals Smith -and Initials -equals BH -sep " " -element LastName,Initials Affiliation

The xtract command continues on the fourth line by checking each <Author> element in a PubMed record (-block Author). If a given author’s <LastName> is Smith AND <Initials> are BH (-if LastName -equals Smith -and Initials -equals BH), the xtract command populates the second column with the author’s last name and initials (separated by a space), and the third column with the author’s affiliation (-sep " " -element LastName,Initials Affiliation). Outputting the last name and initials into the second column is slightly redundant, as we know that they will always be “Smith BH”. However, it is helpful as a confirmation that our Conditional arguments are correct.

Homework solutions

Question 1

Fetch the records for the following list of PMIDs:

28197844,28176235,28161874,28183232,28164731,27937077,28118756,27845598,27049596,27710139

Write an xtract command that outputs the PMID and Article Title, but only for records that have a structured abstract. Hint: in PubMed records, structured abstracts are broken up into multiple AbstractText elements, each with their own “NlmCategory” attribute.

Solution:

efetch -db pubmed -id 28197844,28176235,28161874,28183232,28164731,27937077,28118756,27845598,27049596,27710139 -format xml | \
xtract -pattern PubmedArticle -if AbstractText@NlmCategory -element MedlineCitation/PMID ArticleTitle

The first line of this solution uses efetch to retrieve several records from PubMed in XML format.

The “|” character pipes the results of our efetch into our next command, and the “\” character at the end of the line allows us to continue our string of commands on the next line, for easier-to-read formatting.

This xtract command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle), but only if the record has one or more “AbstractText” elements that contain an “NlmCategory” attribute (-if AbstractText@NlmCategory). This will ensure that only PubMed records with structured abstracts are included.

For each row in the output, xtract creates two columns: one with the record’s PMID (specifically, the contents of the <PMID> element that is a child of the <MedlineCitation> element), one with the article title (-element MedlineCitation/PMID ArticleTitle).

Question 2

Modify your command from Question 1 to display the “RESULTS” section of each structured abstract, if there is one, in place of the Article Title. If there is no “RESULTS” section, display just the PMID, leaving the second column blank. Hint: Use the “NlmCategory” attribute to determine whether a particular AbstractText element contains “RESULTS”.

Solution:

efetch -db pubmed -id 28197844,28176235,28161874,28183232,28164731,27937077,28118756,27845598,27049596,27710139 -format xml | \
xtract -pattern PubmedArticle -if AbstractText@NlmCategory -element MedlineCitation/PMID \
-block AbstractText -if AbstractText@NlmCategory -equals RESULTS -element AbstractText

This solution begins the same as the solution for Question 2. However, rather than including the article title in the first -element argument, the xtract command continues on the third line (with the “\” character at the end of the second line allowing us to continue our string of commands on the next line, for easier-to-read formatting.

-block AbstractText -if AbstractText@NlmCategory -equals RESULTS -element AbstractText

In the third line, the command uses -block to look for an <AbstractText> element (-block AbstractText), then looks within that <AbstractText> element to see if it has an “NlmCategory” attribute with the value “RESULTS” (-if AbstractText@NlmCategory -equals RESULTS). If it does, the command then outputs the contents of the <AbstractText> element in the second column. If the <AbstractText> element does not have an “NlmCategory” with the value “RESULTS”, the command proceeds to check the next <AbstractText> element in the record. The process repeats for each <AbstractText> element in the record.

Question 3

When indexing a record for MEDLINE, indexers can assign MeSH headings (descriptors) to represent concepts found in an article, and MeSH subheadings (qualifiers) to describe a specific aspect of a concept. Indexers denote some of MeSH headings as “Major Topics” (i.e. one of the primary topics of the article). When assigning a “Major Topic”, the indexer can determine that heading itself is a major topic, or that a specific heading/subheading pair is a major topic. When a heading/subheading pair is assigned as a Major Topic, only the subheading will be labeled as Major in the PubMed XML.

Write an xtract command that outputs one PubMed record per row. Each row should have the record’s PMID and a pipe-delimited list of all of the MeSH Headings the indexers have determined are Major Topics. Note: the list should only include headings (descriptors), not subheadings (qualifiers). However, if a heading/subheading pair is assigned as major, the list should include that heading.

You can use the following efetch command to retrieve some sample records:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \

Solution:

efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block MeshHeading -if DescriptorName@MajorTopicYN -equals Y -or QualifierName@MajorTopicYN -equals Y \
-tab "|" -element DescriptorName

This solution uses the example efetch command to retrieve three PubMed records in XML, then outputs a table with one row per PubMed record. Each row begins with the record’s PMID, followed by a pipe-delimited list of all of the MeSH Headings that the indexers have determined are Major Topics.

xtract -pattern PubmedArticle -element MedlineCitation/PMID \

Beginning on the second line, the xtract command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle) with the record’s PMID (-element MedlineCitation/PMID) and additional data, which is specified on the subsequent lines.

-block MeshHeading -if DescriptorName@MajorTopicYN -equals Y -or QualifierName@MajorTopicYN -equals Y \

In the third line, we start to check each <MeshHeading> element to determine if it has been labeled Major. The command uses -block to look for the first <MeshHeading> element in the record (-block MeshHeading). The command then looks within that <MeshHeading> element to see if its child <DescriptorName> element has a “MajorTopicYN” attribute with a value of “Y” (-if DescriptorName@MajorTopicYN -equals Y), or if any of its child <QualifierName> elements have a “MajorTopicYN” attribute with a value of “Y” (-or QualifierName@MajorTopicYN -equals Y). If either of these are true, the MeSH heading has been labeled as a Major Topic, and the command will continue on the next line (see below). If neither of these conditions are true, the command will proceed to the next <MeshHeading> element and repeat the process, looking for MeSH Headings which are Major Topics.

-tab "|" -element DescriptorName

The fourth line specifies that the DescriptorName will appear in the second column of our table (-element DescriptorName). Each indexed record will have at least one Major Topic assigned, and probably more than one. We use the -tab argument to specify a separator between the multiple MeSH descriptors (-tab "|"). It is important to place the -tab argument after the -block, as -block resets any -tab arguments that have been previously specified. We use -tab instead of -sep, as -block automatically creates a new column at the end of each block, so by specifying “|” in our -tab argument, we insure that our blocks are pipe-delimited.

Question 4

Write a series of commands to search for articles reporting on clinical trials relating to tularemia and output a table of citations. Each row should include the PMID for an article, as well as the name and affiliation information (if any) for the last author. If the last author does not have affiliation information, put “Not Available” in the last column instead.

Solution:

esearch -db pubmed -query "tularemia clinical trial" | \
efetch -format xml | \
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block Author -position last -sep " " -def "Not Available" -element LastName,Initials Affiliation

This solution of commands searches PubMed for the string “tularemia clinical trial”, retrieves the full XML records and outputs the PMID as well as the last name, initials affiliation information (if any) of each article’s last author.

esearch -db pubmed -query "tularemia clinical trial" | \

The first line of this command uses esearch to search PubMed (-db pubmed) for our search query (-query "tularemia clinical trial").

efetch -format xml | \

-pattern PubmedArticle -element MedlineCitation/PMID \

Beginning on the third line, the xtract command creates a table, with each PubMed record in our input populating its own row (xtract -pattern PubmedArticle) with the record’s PMID (-element MedlineCitation/PMID) and additional data, which is specified on the subsequent lines.

-block Author -position last -sep " " -def "Not Available" -element LastName,Initials Affiliation

The fourth line uses the -block and -position arguments to identify the last <Author> element in each record (-block Author -position last). The last name and initials of the last author, separated by a space (-sep " "), are placed in the second column, with the last author’s affiliation information (if present) is placed in the third column (-element LastName,Initials Affiliation). If the last author has no affiliation information, the third column will contain the default value of “Not Available” instead of being left blank (-def "Not Available")

Last Reviewed: August 6, 2021

The Insider's Guide to Accessing NLM Data

"EDirect for PubMed: Part 4: xtract Conditional Arguments" Sample Code

Solution:

Solution:

Solution:

Solution:

Solution:

Solution:

Solution: