xtract: Formatting arguments
As new Insider's Guide classes are no longer being offered, this site is not currently being updated. Please refer to NCBI's E-utilities documentation for more up-to-date information.
In addition to using xtract
to specify which data to include in your output, you can also use xtract
to customize the way your output displays.
Adjusting the separator between columns
By default, the separator between xtract
columns is a tab (denoted in Unix as “\t”). This separator can be adjusted using the -tab
argument. For example, the line of code:
xtract -pattern PubmedArticle -tab "|" -element PMID ArticleTitle
will output columns separated by the “|” character:
PMID1|ArticleTitle1
PMID2|ArticleTitle2
PMID3|ArticleTitle3
[...]
Adjusting the separator between multiple elements or attributes in the same column
Depending on your input XML, and on the arguments used in your xtract
command, a single output column may contain multiple pieces of data. This can occur when you specify an element or attribute in the -element
argument that is repeated in your -pattern
, or when you put multiple different elements or attributes in the same column using by using a comma.
By default, the separator between multiple elements or attributes in the same column is a tab (denote in Unix as “\t”). This separator can be adjusted using the -sep
argument. For example, the line of code:
xtract -pattern PubmedArticle -tab "|" -sep " " -element PMID LastName
will output columns separated by the “|” character, with multiple elements or attributes in the same column separated by a space:
PMID1|LastName1.1 LastName1.2 LastName1.3
PMID2|LastName2.1 LastName2.2
PMID3|LastName3.1 LastName3.2 LastName3.3 LastName4.3
[...]
It may be useful to use the -tab
and -sep
arguments to define one character to separate between columns, and a different character to separate multiple elements in the same column. For example:
xtract -pattern PubmedArticle -tab "|" -sep "/" -element MedlineCitation/PMID DescriptorName
will create a new row for each PubMed record, with two columns separated by “|”. The first column will have the record’s PMID. The second column will have a list of all of the MeSH headings attached to the record, with each heading separated by a “/”.
Specifying a placeholder for blank cells
Depending on your input XML, and on the arguments used in your xtract
command, one or more of the output rows for a given column may contain no data. Depending on the order of your columns, and on how you define your separators, these blank cells may throw off the alignment of your table.
For example, if you wrote an xtract command to retrieve the PMID, journal title abbreviation, volume number, issue number, pagination, and article title:
xtract -pattern PubmedArticle -tab "|" -element ISOAbbreviation Volume Issue MedlinePgn ArticleTitle
You might expect an output that looks like this:
PMID1|ISOAbbreviation1|Volume1|Issue1|MedlinePgn1|ArticleTitle1
PMID2|ISOAbbreviation2|Volume2|Issue2|MedlinePgn2|ArticleTitle2
PMID3|ISOAbbreviation3|Volume3|Issue3|MedlinePgn3|ArticleTitle3
PMID4|ISOAbbreviation4|Volume4|Issue4|MedlinePgn4|ArticleTitle4
However, if some of the citations are from journals that do not have issue numbers (or volume numbers), you may not get the output you want:
PMID1|ISOAbbreviation1|Volume1|Issue1|MedlinePgn1|ArticleTitle1
PMID2|ISOAbbreviation2|Volume2|MedlinePgn2|ArticleTitle2
PMID3|ISOAbbreviation3|Volume3|Issue3|MedlinePgn3|ArticleTitle3
PMID4|ISOAbbreviation4|Issue4|MedlinePgn4|ArticleTitle4
As you can see, in the second row, the pagination appears in the issue number column, while the article title is shifted over into the pagination column. In the fourth row, we have a similar problem, with the issue number in the volume column, and all of the subsequent columns misaligned.
You can use the -def
argument to define a placeholder (i.e. a “default” value) for cells that would otherwise be blank, which will ensure that the correct data ends up in the correct column.
xtract -pattern PubmedArticle -tab "|" -def "N/A" -element ISOAbbreviation Volume Issue MedlinePgn ArticleTitle
This will give you a much more desirable output:
PMID1|ISOAbbreviation1|Volume1|Issue1|MedlinePgn1|ArticleTitle1
PMID2|ISOAbbreviation2|Volume2|N/A|MedlinePgn2|ArticleTitle2
PMID3|ISOAbbreviation3|Volume3|Issue3|MedlinePgn3|ArticleTitle3
PMID4|ISOAbbreviation4|N/A|Issue4|MedlinePgn4|ArticleTitle4
Our issues in the second row are now corrected, as the issue column has an “N/A” placeholder, ensuring that the subsequent columns are correctly aligned. Likewise, our fourth row is correctly aligned due to the “N/A” placeholder in the volume column.