---------------------------------------------------------------------- The Insider's Guide to Accessing NLM Data: EDirect for PubMed Part Three: Formatting Results and Unix tools Course Materials ---------------------------------------------------------------------- NOTE: Solutions to all exercises are at the bottom of this document. ---------------------------------------------------------------------- REMINDERS FROM PART ONE esearch: Searches a database and returns PMIDs efetch: Retrieves PubMed records in a variety of formats Use "|" (Shift + \, pronounced "pipe") to "pipe" the results of one command into the next ---------------------------------------------------------------------- REMINDERS FROM PART TWO xtract: Pulls data from XML and arranges it in a table -pattern: Defines rows for xtract -element: Defines columns for xtract Identify XML elements by name (e.g. ArticleTitle) Identify specific child elements with Parent/Child construction (e.g. MedlineCitation/PMID) Identify attributes with "@" (e.g. MedlineCitation@Status) ---------------------------------------------------------------------- TIPS FOR CYGWIN USERS: Copy: Ctrl + Insert (NOT Ctrl + C!) Paste: Shift + Insert (NOT Ctrl + V!) ---------------------------------------------------------------------- TIPS FOR ALL USERS: Ctrl + C "cancels" and gets you back to a prompt Up and Down arrow keys allow you to cycle through your recent commands clear: clears your screen ---------------------------------------------------------------------- -tab and -sep -tab defines the separator between columns -sep defines the separator between multiple values in the same columns The default for both -tab and -sep is "\t" (the tab character) Changes to -tab and -sep only affect subsequent -element/-first/-last arguments COMMAND STRING: efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -element MedlineCitation/PMID ISSN LastName efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -tab "\t" -sep "\t" \ -element MedlineCitation/PMID ISSN LastName efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -tab "\t" -sep " " \ -element MedlineCitation/PMID ISSN LastName efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -tab "|" -sep " " \ -element MedlineCitation/PMID ISSN LastName efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -tab "|" -sep "," \ -element MedlineCitation/PMID ISSN LastName efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -tab "|" -sep ", " \ -element MedlineCitation/PMID ISSN LastName ---------------------------------------------------------------------- With -tab/-sep, order matters! -tab/-sep only affect subsequent -elements Later -tab/-sep overwrite earlier ones ---------------------------------------------------------------------- EXERCISE 1 Write an xtract command that: * has a new row for each PubMed record * has columns for PMID, Journal Title Abbreviation, and Author-supplied Keywords Each column should be separated by "|" Multiple keywords in the last column should be separated with commas Sample Output: 26359634|Elife|Argonaute,RNA silencing,biochemistry,biophysics,human,microRNA,structural biology Use the following efetch as input: efetch -db pubmed -id 26359634,24102982,28194521,27794519 -format xml | \ (ANSWERS TO ALL EXERCISES ARE AT THE BOTTOM OF THIS HANDOUT.) ---------------------------------------------------------------------- Authors: First Draft COMMAND STRING: efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -element MedlineCitation/PMID LastName Initials ---------------------------------------------------------------------- -block -block associates multiple child elements of the same parent element in the results COMMAND STRING: efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -element MedlineCitation/PMID -block Author -element LastName Initials ---------------------------------------------------------------------- What we know so far... efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -tab "|" -sep ", " \ -element MedlineCitation/PMID ISSN LastName ---------------------------------------------------------------------- Putting two different elements in the same column Separate multiple -element values with a comma instead of a space. COMMAND STRING: efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -element MedlineCitation/PMID \ -block Author -sep " " -element LastName,Initials ---------------------------------------------------------------------- "-block" resets -tab/-sep to default COMMAND STRING: efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -tab "|" -element MedlineCitation/PMID \ -block Author -sep " " -element LastName,Initials efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -tab "|" -element MedlineCitation/PMID \ -block Author -tab "|" -sep " " -element LastName,Initials ---------------------------------------------------------------------- EXERCISE 2 Write an xtract command that: * Has a new row for each PubMed record * Has a column for PMID * Lists all of the MeSH headings, separated by "|" * If a heading has multiple subheadings attached, separate the heading and subheadings with "/" Sample Output: 24102982|Cell Fusion|Myoblasts/cytology/metabolism|Muscle Development/physiology Use the following efetch as input: efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ (ANSWERS TO ALL EXERCISES ARE AT THE BOTTOM OF THIS HANDOUT.) ---------------------------------------------------------------------- Saving results to a file Use ">" to save the output to a file COMMAND STRING: efetch -db pubmed -id 24102982,21171099,17150207 -format xml > testfile.txt efetch -db pubmed -id 24102982,21171099,17150207 -format xml > testfile.xml ---------------------------------------------------------------------- But where is my file!? Use "pwd" to "Print the Working Directory" (a.k.a display on the screen the name of the directory you are working in). This is where your file was saved. CYGWIN USERS: Your working directory is probably a subfolder of the folder where you installed Cygwin. In Cygwin, try: cygpath -w ~ MAC USERS: Your working directory is probably in your Users folder: Users/ ---------------------------------------------------------------------- Another way to find your files COMMAND STRING: efetch -db pubmed -id 24102982,21171099,25359968,17150207 -format uid > specialname.csv Use "ls" to list the files in your current directory. ---------------------------------------------------------------------- EXERCISE 3: Retrieving XML How can I get the full XML of all articles about the relationship of Zika Virus to microcephaly in Brazil? Save your results to a file. (ANSWERS TO ALL EXERCISES ARE AT THE BOTTOM OF THIS HANDOUT.) ---------------------------------------------------------------------- cat Short for concatenate, "cat" opens files to display them on the screen. "cat" can also combine/append files ---------------------------------------------------------------------- Reading a search string from a file Use "$(cat filename)" to use the contents of a file in a command COMMAND STRING: esearch -db pubmed -query "$(cat searchstring.txt)" ---------------------------------------------------------------------- epost uploads a list of PMIDs to the history server COMMAND STRING: epost -db pubmed -id 24102982,21171099 epost -db pubmed -id 24102982,21171099 | efetch -format abstract ---------------------------------------------------------------------- An epost-efetch pipeline cat specialname.csv | epost -db pubmed | efetch -format abstract Using the -input argument epost -db pubmed -input specialname.csv | efetch -format abstract -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- HOMEWORK FOR PART THREE -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- (Answers are available at: https://dataguide.nlm.nih.gov/classes/edirect-for-pubmed/samplecode3.html) ----------------------------------------------------------------------- Question 1: In the PubMed XML of each record, there is a element, with one or more elements which provide dates for various stages in each article's life cycle. These can include when the article was submitted to the publisher for review, when the article was accepted by the publisher for publication, when it was added to PubMed, and/or when it was indexed for MEDLINE, among others. Not all citations will include entries for each type of date. For the following list of PMIDs 22389010 20060130 14678125 19750182 19042713 18586245 write a series of commands that retrieves each record and extracts all of these different dates, along with the labels that indicate which type of date is which. Each PubMed record should appear on a separate line. Each line should start with the PMID, followed by a tab, followed by the list of dates. For each date, include the label, followed by a ":", followed by the year, month and day, separated by slashes. Separate each date with a "|". Example output: 18586245 received:2008/01/21|revised:2008/05/05|accepted:2008/05/07|pubmed:2008/7/1|medline:2008/10/28|entrez:2008/7/1 ----------------------------------------------------------------------- Question 2: Identify your "working directory". Write a series of commands that retrieve PubMed data, redirect the output to a file, and locate the file on your computer. ----------------------------------------------------------------------- Question 3: Write a series of commands that identifies the top ten agencies that have most frequently funded published research on diabetes and pregnancy over the last year and a half. Your script should start with a search for articles about diabetes and pregnancy that were published between January 1, 2016 and June 30, 2017, should extract the agencies listed as funders on each citation, and should output a list of the ten most frequently occurring agencies. Save the results to a file. Note: This script may take some time to run. As you build it, consider testing with small set of PubMed records, or with a search that has a smaller date range. ----------------------------------------------------------------------- Question 4: Write a PubMed search strategy and save it to a file. Write a series of commands to search PubMed using the search string contained in the file and retrieve a list of PMIDs for all records which meet the search criteria. ----------------------------------------------------------------------- Question 5: Save the following list of PMIDs in a .csv file: 22389010 20060130 14678125 19750182 19042713 18586245 Write a series of commands to retrieve the full PubMed XML records for all of the PMIDs in the file, and save the resulting XML to a .xml file. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- EXERCISE SOLUTIONS: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- EXERCISE 1 Write an xtract command that: * has a new row for each PubMed recprd * has columns for PMID, Journal Title Abbreviation, and Author-supplied Keywords Each column should be separated by "|" Multiple keywords in the last column should be separated with commas Sample Output: 26359634|Elife|Argonaute,RNA silencing,biochemistry,biophysics,human,microRNA,structural biology SOLUTION: efetch -db pubmed -id 26359634,24102982,28194521,27794519 -format xml | \ xtract -pattern PubmedArticle -tab "|" -sep "," -element MedlineCitation/PMID ISOAbbreviation Keyword -=-=-=-=-=-=-=-=-=-=-=-=- EXERCISE 2: Write an xtract command that: * Has a new row for each PubMed record * Has a column for PMID * Lists all of the MeSH headings, separated by "|" * If a heading has multiple subheadings attached, separate the heading and subheadings with "/" Sample Output: 24102982|Cell Fusion|Myoblasts/cytology/metabolism|Muscle Development/physiology SOLUTION: efetch -db pubmed -id 24102982,21171099,17150207 -format xml | \ xtract -pattern PubmedArticle -tab "|" -element MedlineCitation/PMID \ -block MeshHeading -tab "|" -sep "/" -element DescriptorName,QualifierName -=-=-=-=-=-=-=-=-=-=-=-=- EXERCISE 3: Retrieving XML How can I get the full XML of all articles about the relationship of Zika Virus to microcephaly in Brazil? Save your results to a file. SOLUTION: esearch -db pubmed \ -query "zika virus microcephaly brazil" | \ efetch -format xml > zika.xml