Guest Commentary: Turning text into insights

Epilepsy, leukemia, psoriasis and dry eye syndrome are among the diseases and conditions that received critical FDA approvals for new treatments in 2016. The scientists who develop new drugs to treat these and other diseases rely on published scientific literature to help guide their research efforts toward new insights and discoveries. But given the volume of content published each year, researchers fight an uphill battle to stay abreast of recent developments in their therapeutic areas.

Consider: more than 2.5 million peer-reviewed articles appear in scholarly journals in a single year.[1] It’s neither feasible nor realistic to think that any one individual or team of people could keep up with and synthesize this amount of information.

Fortunately, with the right technology, such as text mining tools, researchers can meet this challenge head on.

Conducting comprehensive searches, ensuring relevant results

Simple web searches suffice for most personal—and many professional—purposes. For example, people who search the web for vacation destinations do not need to read through every website they discover. They inevitably view a few sites and make their vacation plans based upon what they’ve learned. But for researchers working to understand all the genes involved in a disease, this approach doesn’t work; perusing just the first few dozen search results is simply not good enough.

It’s close to impossible for researchers to manually conduct systematic or comprehensive searches that efficiently capture all the information they need in a useful way and provide the means to gain deep insight into the content.

Searching abstracts of studies rather than full articles to comb through large amounts of content might seem like an efficient and effective solution, but that approach has its problems. By searching abstracts alone, researchers risk omitting text that might be relevant to their search that is contained within the body, tables, or graphs of articles.

A recent study from the Technical University of Denmark in Kongens Lyngby, which analyzed more than 15 million scientific articles published in English from 1823 to 2016, seems to support the use of full-text articles over abstract in text mining projects. They found that text mining full-text articles gave consistently better results than mining abstracts.

Searching through abstracts can provide valuable information, but researchers might not fully benefit without the full view of the article. A good example of this is researching mutations for rare diseases to enable better treatment. Often, specific mutations are not described in the article abstract but are found deeper in the full-text papers within tables.

The next roadblock researchers face lies within the inconsistent formats and structures of scientific articles. Combinations of unstructured or semi-structured content require standardization for optimal text mining. PDF, HTML and Word files must be converted to a single format (e.g., XML), and the naming conventions used for sections of scientific articles must be normalized. This conversion process is time-consuming and prone to error, often taking up more time than text mining alone.

What is a researcher to do?

Using text-mining tools to drive discovery

Technologies exist that specialize in deploying natural-language-processing-based text mining for high-value knowledge discovery and decision support. Such technologies help solve the above research challenges.

The most widely used solutions combine linguistic tools—which break down the data by subjects, objects and the relationships between the two—with tools that identify semantics, vocabularies or ontologies.

For example, such tools enable a researcher using text mining to search for words associated with a set of genes or a disease, as well as words that are related to specific sets of relationships. For instance, verbs such as “is involved in,” “is a component of,” “has a role in,” etc. This approach ensures that the results will indicate whether a gene is involved in a disease.

Driving real outcomes for companies and consumers

The use of text and data mining technology helps researchers overcome the major hurdle of extracting information from the continually growing body of research and compiling it into a structured, digestible form. With tools to produce structured data, researchers can analyze information, visualize results and gain actionable insights more quickly than they would without such tools.

Technology solutions are vital to stay at the forefront of research and discovery and to overcome the challenges inherent to pharmacological research. The combination of curated databases with text-mining solutions can help researchers make incremental gains at each step of the research process, helping them to expedite the drug discovery cycle, reduce research costs, beat the competition and, most importantly, improve patient outcomes.

Jane Reed is the head of life science strategy at Linguamatics. She is responsible for the life-sciences business unit, developing the strategic vision for Linguamatics’ growing product portfolio—including its text mining platform, I2E, partner relationships and business development for pharma and biotech. Reed has extensive experience in life-sciences informatics, including roles at Instem, BioWisdom, Incyte and Hexagen.

[1] JISC (2015) The Value and Benefit of Text Mining to UK Further and Higher Education. Digital Infrastructure. Available at: http://bit.ly/jisc-textm Programme: Digital Infrastructure www.jisc.ac.uk/whatwedo/programmes/di_directions.aspx

LINKS

consistently better results

text mining platform, I2E

https://www.linguamatics.com/products-services/about-i2e