xtract
As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.
The xtract
command lets you choose specific data from an XML file, extracts that data, and displays the data in a tabular format. The xtract
command is incredibly customizable, and allows you to choose which data is output, how it is arranged, and how the table is formatted.
Because xtract
is so flexible, it can take some time to understand all of the options available and how they can best be used.
Input
The xtract
command accepts as input structured XML data from a variety of sources. The XML data can be:
- piped in from
efetch -format xml
command. - piped in from a
cat file.xml
command (wherefile.xml
is any XML file). - or, supplied via the
xtract
argument-input
:
xtract -input file.xml
(Note that the -input
argument is a more recent addition to xtract
. It was added in EDirect version 5.00, in September 2016.)
Output
Data extracted from an XML file, arranged in a tabular format.
Working with XML
In order to use xtract
effectively, it is helpful to have a basic understanding with structured XML data. For a brief overview on XML, you may want to visit W3Schools’ XML Tutorial.
The xtract
command has several arguments that identify particular portions of an XML document. xtract
uses these arguments to select and arrange data for output.
To specify most XML elements, simply provide the name of the element. Unix and EDirect are case-sensitive, so be sure to check spelling and capitalization. The command below specifies the XML element <Author>
in the -pattern
argument, and the XML element <LastName>
in the -element
argument.
xtract -pattern Author -element LastName
To specify an attribute of an XML element, provide the name of the element, followed by “@”, followed by the name of the attribute. The command below specifies the XML element <PubmedArticle>
in the -pattern
argument, and the attribute “Status” of the XML element <MedlineCitation>
in the -element
argument.
xtract -pattern PubmedArticle -element MedlineCitation@Status
In some circumstances, an XML document may have multiple elements with the same name located in different parts of the XML hierarchy. To specify an XML element in a specific location in the document, you can use a slash (/) to indicate Parent/Child construction: provide the name of the parent element, followed by “/”, followed by the name of the child element. The command below specifies the XML element <PubmedArticle>
in the -pattern
argument, and the element <Year>
which is a child of the element <PubDate>
.
xtract -pattern PubmedArticle -element PubDate/Year
Working with PubMed XML
When using xtract
with PubMed XML, it is important to be familiar with the structure and contents of PubMed data. For more information on PubMed XML, please see MEDLINE/PubMed XML Data Elements.