xtract: Formatting arguments


This documentation reflects EDirect version 7.60, released on 11/14/2017.

We strive to keep this documentation up-to-date with the latest release. If you are looking for documentation on a more recent version of EDirect, or to find out more about new EDirect releases, please see the Release Notes of NCBI's EDirect documentation.


In addition to using xtract to specify which data to include in your output, you can also use xtract to customize the way your output displays.

Adjusting the separator between columns

By default, the separator between xtract columns is a tab (denoted in Unix as “\t”). This separator can be adjusted using the -tab argument. For example, the line of code:

xtract -pattern PubmedArticle -tab "|" -element PMID ArticleTitle

will output columns separated by the “|” character:

PMID1|ArticleTitle1
PMID2|ArticleTitle2
PMID3|ArticleTitle3
[...]

Adjusting the separator between multiple elements or attributes in the same column

Depending on your input XML, and on the arguments used in your xtract command, a single output column may contain multiple pieces of data. This can occur when you specify an element or attribute in the -element argument that is repeated in your -pattern, or when you put multiple different elements or attributes in the same column using by using a comma.

By default, the separator between multiple elements or attributes in the same column is a tab (denote in Unix as “\t”). This separator can be adjusted using the -sep argument. For example, the line of code:

xtract -pattern PubmedArticle -tab "|" -sep " " -element PMID LastName

will output columns separated by the “|” character, with multiple elements or attributes in the same column separated by a space:

PMID1|LastName1.1 LastName1.2 LastName1.3
PMID2|LastName2.1 LastName2.2
PMID3|LastName3.1 LastName3.2 LastName3.3 LastName4.3
[...]

It may be useful to use the -tab and -sep arguments to define one character to separate between columns, and a different character to separate multiple elements in the same column. For example:

xtract -pattern PubmedArticle -tab "|" -sep "/" -element MedlineCitation/PMID DescriptorName

will create a new row for each PubMed record, with two columns separated by “|”. The first column will have the record’s PMID. The second column will have a list of all of the MeSH headings attached to the record, with each heading separated by a “/”.

Specifying a placeholder for blank cells

Depending on your input XML, and on the arguments used in your xtract command, one or more of the output rows for a given column may contain no data. Depending on the order of your columns, and on how you define your separators, these blank cells may throw off the alignment of your table.

For example, if you wrote an xtract command to retrieve the PMID, journal title abbreviation, volume number, issue number, pagination, and article title:

xtract -pattern PubmedArticle -tab "|" -element ISOAbbreviation Volume Issue MedlinePgn ArticleTitle

You might expect an output that looks like this:

PMID1|ISOAbbreviation1|Volume1|Issue1|MedlinePgn1|ArticleTitle1
PMID2|ISOAbbreviation2|Volume2|Issue2|MedlinePgn2|ArticleTitle2
PMID3|ISOAbbreviation3|Volume3|Issue3|MedlinePgn3|ArticleTitle3
PMID4|ISOAbbreviation4|Volume4|Issue4|MedlinePgn4|ArticleTitle4

However, if some of the citations are from journals that do not have issue numbers (or volume numbers), you may not get the output you want:

PMID1|ISOAbbreviation1|Volume1|Issue1|MedlinePgn1|ArticleTitle1
PMID2|ISOAbbreviation2|Volume2|MedlinePgn2|ArticleTitle2
PMID3|ISOAbbreviation3|Volume3|Issue3|MedlinePgn3|ArticleTitle3
PMID4|ISOAbbreviation4|Issue4|MedlinePgn4|ArticleTitle4

As you can see, in the second row, the pagination appears in the issue number column, while the article title is shifted over into the pagination column. In the fourth row, we have a similar problem, with the issue number in the volume column, and all of the subsequent columns misaligned.

You can use the -def argument to define a placeholder (i.e. a “default” value) for cells that would otherwise be blank, which will ensure that the correct data ends up in the correct column.

xtract -pattern PubmedArticle -tab "|" -def "N/A" -element ISOAbbreviation Volume Issue MedlinePgn ArticleTitle

This will give you a much more desirable output:

PMID1|ISOAbbreviation1|Volume1|Issue1|MedlinePgn1|ArticleTitle1
PMID2|ISOAbbreviation2|Volume2|N/A|MedlinePgn2|ArticleTitle2
PMID3|ISOAbbreviation3|Volume3|Issue3|MedlinePgn3|ArticleTitle3
PMID4|ISOAbbreviation4|N/A|Issue4|MedlinePgn4|ArticleTitle4

Our issues in the second row are now corrected, as the issue column has an “N/A” placeholder, ensuring that the subsequent columns are correctly aligned. Likewise, our fourth row is correctly aligned due to the “N/A” placeholder in the volume column.