Building practical solutions
Now that you are familiar with the basics of Unix and EDirect, you may be eager to put these tools to work searching PubMed, retrieving records in XML, and extracting data into custom tables. Perhaps you have reviewed some of our EDirect documentation to see the details of how you can put these tools to work.
However, if you are new to computer programming, building your first script can still be daunting. Where do you start? What tools should you use? Should I even be using EDirect to solve this problem?
As we mentioned in a previous section, there are almost always multiple ways to accomplish a task in Unix, so there are often many “right” answers. Below, we present some steps for you to think about when considering an EDirect project.
- Identify your goal
- Choose your tool
- Decide how much to automate
- Build one step at a time
1. Identify your goal
Before starting to build a script, it can be helpful to think about your goal, and make sure you have a good idea of what you’re trying to accomplish. You will want to consider the input for your script, the output for your script, and the desired format.
The first thing you want to determine is what information you will be feeding into this solution. In other words, what do you already know? Perhaps you have a PubMed search strategy. Perhaps you have a list of PMIDs for PubMed records that you want to retrieve and analyze. Perhaps you have an XML document that already contains the PubMed records you wish to work with. You will want to build the first steps in your script to accommodate the type of information you want to input into the script.
What is it that you want to know? Do you want a list of PMIDs? Do you want specific data from within a PubMed record? Which fields specifically are you trying to output? Having a concrete output in mind will help shape the direction of your development.
Once you have identified the data you want output, you also want to think about the way you would like this data arranged. Is an XML file acceptable, or would you prefer a tabular output? If you are trying to create a table, what order should the columns be in? How should the columns be separated? These details will help you refine and polish your script, to get the results in the exact format you need.
2. Choose your tool
The tool you choose for a project should serve the goal, not the other way around. EDirect can be very useful, but may not always be the most practical way to solve a problem. For example, if the question you are asking could be more quickly and efficiently answered going to www.pubmed.gov, using EDirect is a waste of time. Likewise, if you require an extremely large amount of data to answer your question, EDirect might not be the best option.
To avoid overloading the E-utilities servers (which are contacted by many of the EDirect commands), NCBI asks that you follow a few usage guidelines. Details of these guidelines can be found in NCBI’s “A General Introduction to E-utilities” in the Usage Guidelines and Requirements section. If your project requires accessing an exceptionally large amount of PubMed data (for example, the entire PubMed database), you may be better served by using the bulk download options for PubMed data, offered via the NLM Data Distribution program.
3. Decide how much to automate
One of the reasons EDirect is so useful is its ability to automate repetitive and time-consuming processes. However, the process of developing an EDirect script may also take some time, especially for newer users. It is worth considering whether the time spent in development is worth the efficiencies gained through automation.
This consideration is especially important when determining when to stop development. There is almost always a way to accomplish 100% of your goal in a single script. However, there are usually also ways of accomplishing 90%, 75% or 50% of your goal in a single script, and doing the remaining 10%, 25% or 50% manually. You will need to decide whether the additional time and effort it will take to get from 90% to 100% is more or less efficient than simply doing the remaining 10% manually.
Additionally, it is important to consider whether your process requires human judgement. Automation can save time, but evaluation of results by a human is often necessary. Ideally, automating the repetitive processes will free up more time for you to perform the tasks that require human involvement.
4. Build one step at a time
Once you have completed the big-picture thinking, it is time to start building your script. Because of the modular nature of Unix, it is often possible to develop each step of your pipeline on its own, testing to make sure it works before integrating it into the larger solution. This can save you troubleshooting time later, as it allows you to identify problems earlier, without having to execute your entire script.
If you are interested in learning more about EDirect, in a hands-on environment, our EDirect for PubMed class introduces new users to working with EDirect commands to access PubMed data. Over the course of three 90-minute sessions, students will learn how to use EDirect commands in a Unix environment to access PubMed, design custom output formats, create basic data pipelines to get data quickly and efficiently, and develop simple strategies for solving real-world PubMed data-gathering challenges.
If you prefer to dive right in, take a look at our EDirect documentation to better understand the uses and syntax of the various EDirect commands, and get started creating practical solutions to PubMed problems with EDirect!