xtract: Exploration arguments


This documentation reflects EDirect version 7.60, released on 11/14/2017.

We strive to keep this documentation up-to-date with the latest release. If you are looking for documentation on a more recent version of EDirect, or to find out more about new EDirect releases, please see the Release Notes of NCBI's EDirect documentation.


When using xtract, there may be times when you want to group multiple elements together in your output. This can be especially useful to link multiple child elements of the same parent together. The xtract command includes a series of arguments that help with this. These arguments, including -group, -block, and -subset, are referred to as Exploration arguments, because they help you explore subsections of your XML document.

When to use Exploration arguments

For example, if you are trying to write an xtract command that outputs article PMIDs and author names (including last names and initials), with each row representing a different PubMed article, and with a “|” separating the columns, you might use the following command:

xtract -pattern PubmedArticle -tab "|" -sep "|" -element MedlineCitation/PMID LastName Initials

This command would work if each article only had one author. However, if an article has more than one author, the output may not be what you expect:

PMID1|LastName1.1|LastName1.2|LastName1.3|Initials1.1|Initials1.2|Initials1.3
PMID2|LastName2.1|LastName2.2|Initials2.1|Initials2.2
[...]

Because -element creates a column populated with each instance of the element or attribute in the -pattern, xtract will create two columns: one with every <LastName> element in the -pattern, and one with every <Initials> element in the -pattern. If you wanted to group together each individual <LastName> element with the individual <Initials> element that shares a parent <Author> element, you could use an Exploration argument.

How to use Exploration arguments

Continuing the previous example, we could modify our command to connect each individual <LastName> element with its corresponding <Initials> element by using the -block argument:

xtract -pattern PubmedArticle -tab "|" -sep "|" -element MedlineCitation/PMID -block Author -element LastName Initials

As with most xtract commands, this command scans through the XML input from the beginning. When it encounters an occurrence of the element specified in the -pattern argument, xtract will start a new row of output. The command will then scan through the -pattern until it encounters the first occurrence of the XML element specified in the -block argument (in this case, <Author>).

The command will then scan through the -block, looking for every instance of the first element or attribute specified in the -element argument. xtract will populate the first column with each instance of the element or attribute encountered within the -block.

When xtract reaches the end of the -block, it goes back to the beginning of the -block and begins looking for every instance of the second element or attribute specified in the -element argument (if there is one). This process repeats, creating new columns for each -element until all of the elements specified in -element have been returned.

The command will then look for the next occurrence of the element specified in the -block argument. If another occurrence of the -block element is found, the command repeats the above process, retrieving occurrences of each element within the -block, before moving on to look for a new -block.

The result of this process is an output that resembles the following:

PMID1|LastName1.1|Initials1.1|LastName1.2|Initials1.2|LastName1.3|Initials1.3
PMID2|LastName2.1|Initials2.1|LastName2.2|Initials2.2

To make this command even more effective, you can use multiple elements or attributes in a single column by using a comma. By modifying the command slightly:

xtract -pattern PubmedArticle -tab "|" -sep " " -element MedlineCitation/PMID -block Author -element LastName,Initials

the output will change to:

PMID1|LastName1.1 Initials1.1|LastName1.2 Initials1.2|LastName1.3 Initials1.3
PMID2|LastName2.1 Initials2.1|LastName2.2 Initials2.2

The Exploration hierarchy

All of the previous examples use only the argument -block. However, for advanced uses, there is a multi-level hierarchy of Exploration arguments, allowing you to explore sections, subsections, and sub-subsections of an XML document in the same xtract command. Technically, -pattern is an Exploration argument, as it subdivides the input XML into smaller sections, and connects the elements within each section (by placing them in the same row). From largest to smallest, the Exploration arguments are

-pattern
    -division
        -group
            -branch
                -block
                    -section
                        -subset
                            -unit

For most use cases, only -pattern and -block will be necessary. To explore deeply-nested XML, you may wish to use -group and -subset as well. However, the remaining Exploration arguments are available for unusual cases. For more about using Exploration arguments, please visit NCBI’s EDirect documentation page, “Entrez Direct: E-utilities on the UNIX Command Line”.