xtract: Filtering output with Conditional arguments


This documentation reflects EDirect version 7.60, released on 11/14/2017.

We strive to keep this documentation up-to-date with the latest release. If you are looking for documentation on a more recent version of EDirect, or to find out more about new EDirect releases, please see the Release Notes of NCBI's EDirect documentation.


When using xtract to process XML data at the end of a data pipeline that includes an esearch or efilter command, you can use PubMed’s suite of search tools to limit your output by adding further search criteria to your esearch or efilter. However, there may be cases where you wish to restrict your output based on data that is not traditionally searchable in PubMed. To accommodate this, xtract offers a group of arguments which include or exclude data from the output table, based on whether or not specific conditions are met.

(Note that the xtract Conditional arguments have been redesigned and improved substantially in recent EDirect releases, starting with EDirect Version 5.20 in October 2016.)

Limiting output based on the presence or absence of an element or attribute

By default, the xtract command creates a new row for each occurrence of the XML element identified in the -pattern argument. You can use the -if argument to instruct xtract to only create a row for a given pattern if the pattern contains the element or attribute identified in the -if argument. Otherwise, xtract will skip the pattern and move on to the next pattern.

For example, the command

xtract -pattern Author -if Identifier -element LastName Initials Identifier

will output the last name, initials and <Identifier> (usually the author’s ORCID) of each <Author> in the XML input, if the <Author> contains an <Identifier> element. If the <Author> does not contain an <Identifier>, the <Author> is skipped.

The -unless argument functions exactly like the -if argument, but reversed: xtract will create a row for every pattern, unless the pattern contains the element or attribute identified in the -unless argument. For example, the command

xtract -pattern PubmedArticle -unless MeshHeading -element MedlineCitation/PMID

will output the PMID of each <PubmedArticle>, unless the <PubmedArticle> contains <MeshHeading> element.

Limiting output based on the value of an element or attribute

The previous example showed one way of excluding records that have been indexed with MeSH from your output. However, as in many cases with EDirect, there are several ways to accomplish this goal. All PubMed records that have been indexed with MeSH also have a citation status of “MEDLINE”.

Using -if or -unless, we can also include or exclude data based on the value/contents of a particular element or attribute by using the -equals argument.

-if Element -equals String
-unless Element -equals String

With this syntax, we could re-write our previous command to exclude all records that have a citation status of “MEDLINE”, rather than to exclude all records that contain a <MeshHeading> element:

xtract -pattern PubmedArticle -unless MedlineCitation@Status -equals MEDLINE -element MedlineCitation/PMID

We could also use this method to only include certain elements, based on the value of one of the element’s attributes:

xtract -pattern DescriptorName -if DescriptorName@MajorTopicYN -equals Y -element DescriptorName

This command would create a new row for each <DescriptorName> element that has a “MajorTopicYN” attribute with a value of “Y”, and will output the <DescriptorName>. This will essentially provide a list of the MeSH Headings in the input XML that are flagged as Major Topics, and will exclude all other MeSH Headings.

In general, strings specified in an -equals argument do not need to be enclosed in quotes. However, if the string includes a space, enclose the entire string in double quotes to ensure that xtract matches the entire string.

Alternatives to -equals

While -equals is useful if we know the exact value we want to match, -if and -unless also work with a series of other arguments that allow more flexibility in creating conditions:

  • -equals: The element/attribute must exactly match the string you specify.
  • -contains: The element/attribute must contain the string you specify.
  • -starts-with: The element/attribute must start with the string you specify.
  • -ends-with: The element/attribute must end with the string you specify.
  • -is-not: The element/attribute must not match the string you specify.

If the element or attribute specified in your -if or -unless argument contains numeric data (like Year, Volume, Issue, etc.), you may prefer to use Conditional arguments that treat data as numbers, not as strings:

  • -gt: The value of the element/attribute is greater than the number you specify.
  • -ge: The value of the element/attribute is greater than or equal to the number you specify.
  • -lt: The value of the element/attribute is less than the number you specify.
  • -le: The value of the element/attribute is less than or equal to the number you specify.
  • -eq: The value of the element/attribute is equal to the number you specify.
  • -ne: The value of the element/attribute is not equal to the number you specify.

Using -if and -unless with Exploration arguments

The xtract Conditional arguments can be especially powerful when combined with Exploration arguments. By applying an -if or -unless argument to a -block argument, data from an individual -block can be included or excluded, based on whether the -block meets the conditions specified by -if or -unless.

For example, if we wanted a list of every PMID of each PubMed record in our XML file, along with the article’s DOI (if one is provided), we could use the following command:

xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block ArticleId -if ArticleId@IdType -equals doi -element ArticleId

This command creates a new row for each PubMed record, and prints the record’s PMID in the first column. The command then looks at each occurrence of the <ArticleID> element within the “PubmedArticle” -pattern individually. If an occurrence of the <ArticleID> element has an “IdType” attribute of “doi”, the second -element argument will display the <ArticleID>. This will result in a two column table, where the first column is always a record’s PMID and the second column is either a record’s DOI (if one is provided) or blank (if no DOI is provided).

Combining conditions with -and/-or

The xtract command allows you to restrict output based on multiple conditions, using Boolean AND/OR, with the -and and -or arguments.

When trying to impose multiple conditions in the same xtract command, the first condition is written as always, using -if or -unless, with an optional -equals, -contains, etc. The syntax for subsequent conditions is almost identical, though we replace the -if or -unless with -or or -and.

We can include or exclude data if any one of multiple conditions are met by using -or. Modifying our previous example, we could use the -or argument to populate the second column of our output if the <ArticleID> element has an “IdType” attribute of “doi” or of “pmc”:

xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block ArticleId -if ArticleId@IdType -equals doi -or ArticleID@IdType -equals pmc -element ArticleId

This will result in a two-column table, where the first column is always a record’s PMID and the second column is either a record’s DOI (if one is provided), a record’s PMCID (if one is provided), both a DOI and a PMCID (if both are provided) or blank (if neither is provided).

We can also include or exclude data only if every one of multiple conditions are met by using -and:

xtract –pattern PubmedArticle –element MedlineCitation/PMID \
–block Author -if LastName -equals Smith –and Initials -equals BH –element Affiliation

This command creates a new row for each PubMed record, and prints the record’s PMID in the first column. The command then looks at each occurrence of the <Author> element within the “PubmedArticle” -pattern individually. If an occurrence of the <Author> element has a <LastName> of “Smith” and an <Initials> of “BH”, the second -element argument will display the <Affiliation>. This will result in a two column table, where the first column is always a record’s PMID and the second column is either affiliation data for the Author BH Smith (if BH Smith is one of the authors on the record) or blank (BH Smith is not one of the record’s authors).

As with -if and -unless, -and and -or can be used with any of the alternatives to -equals (e.g. -contains, -starts-with, etc.) described above.

(Note that there is no -not argument for Boolean NOT, as -if and -unless are functional opposites, and make an additional -not argument unnecessary.)

Limiting output based on position

The -if and -unless arguments include data based on the presence, absence or value of an element or attribute. The -position argument can include data based on its position in an XML document.

The -position argument is used with one of the Exploration arguments, such as -block, and tells xtract to include only data from a single occurrence of a block element that is in a specific position relative to other occurrences of that block. For example:

xtract -pattern PubmedArticle -block Author -position 3 -element LastName

would output the LastName element from the third Author in a given PubMed record.

When using a -position, you can specify an integer, as demonstrated above. However, you can also use -position first to include only the first occurrence of the block (equivalent to -position 1), or -position last to include only the last occurrence of the block. Using -position last can be especially useful when you are unsure how many occurrences of the block there will be in each pattern.