Guest Commentary: Data visualization--New directions or just familiar routes?

There is a risk that visualizations could end up being used to confirm or justify our own hypotheses and biases, but could data visualizations bring to light patterns in our data, drive new hypotheses and show us things we weren’t expecting?
| 6 min read
Register for free to listen to this article
Listen with Speechify
0:00
6:00
Data visualization tools make it very easy to represent our data graphically and present it in a way that clearly communicates patterns and trends. But, there is a risk that visualizations may be used, in practice, to confirm or justify our own hypotheses and biases. Instead, can data visualizations bring to light patterns in our data, drive new hypotheses and show us things we weren’t expecting?
Given the efficiency with which we can process visual information, it is easy to explain the appeal of data visualization. At its best, a visualization can highlight patterns which numerical analyzes might otherwise miss. Anscombe’s quartet (Anscombe, 1973) is a good example of four data sets which are statistically very similar, but which when visualized show very different relationships. This sanity check can be invaluable, and yet it should be remembered that inappropriate choices of plot type, axis scales or directions and color can result in visualizations which might be uninformative or misleading at first glance. Our ability to spot visual patterns quickly can work against us when an inappropriate visualization is presented, whether or not the creator was attempting to mislead us deliberately.
Continue reading below...
A black mosquito is shown on pink human skin against a blurred green backdrop.
InfographicsDiscovering deeper insights into malaria research
Malaria continues to drive urgent research worldwide, with new therapies and tools emerging to combat the parasite’s complex lifecycle and global burden.
Read More
If we consider a drug discovery project for which we have measured potency data, a common question to ask might be “On which compounds should I focus my attention?” We will illustrate this using an example set of 264 compounds, representing six different chemical series, for which 5-HT1A activities have been measured. The examples in Figure 1 show the importance of choosing an appropriate way to represent the range of potencies across the chemistries. The simple histogram shown in A gives a good overview but tells us nothing about how the potencies are distributed across the chemical series, until we introduce some color, as in B. This still isn’t very clear, given the different number of compounds in each chemical series, whereas using a two-dimensional histogram in C and D the height of each bar shows the average potency of a chemical series. The choice of y-axis scale also influences our view of the data, and in C the chemical series look almost identical, whereas in D there appear to be significant differences—but what is the “right” range? In E, having chosen a range based on a potency level we might consider as “inactive to very active,” we can also see the importance of adding error bars to give some indication as to the distribution of potencies. This highlights the value of representing these data using a box plot instead, as in F. Now it becomes clear that each series contains some potent compounds, but the indole-3-alkylamines certainly appear to be the most active.
Continue reading below...
A white, pink, and blue 3D molecular structure of a simple sugar is shown against a light purple background.
WebinarsAdding a little sugar: what glycomics can bring to medicine
Discover how glycoscience is transforming how scientists understand diseases and opening new doors for drug discovery.
Read More
This image is no longer available
Of course, potency is just one of the properties we would need to consider in order to identify a high-quality lead compound, and in our data set we also have predicted values for a number of typical absorption, distribution, metabolism and excretion and physicochemical properties. While we could create a box plot for each property of interest, this would require us to look at and, most importantly, make sense of a large number of visualizations. We could attempt to put many dimensions of data into single a scatter plot—three dimensions plotted with others represented by color, size, transparency, etc.—but unless there are some very obvious outliers, it is likely to be very hard to interpret.
It is worth considering, therefore, the kind of information with which we typically make decisions in drug discovery. All of the properties used to analyze and select compounds are derived from models of the ultimate human patient in which we are interested, whether those models are in vivo, in vitro or in silico. All measured data, however accurate, will contain some degree of uncertainty due to experimental variability, while in-silico models will contain some statistical error. As an example, a good root-mean-square error for an aqueous solubility prediction represented as logS(µM) is approximately 0.6. In practice, this means that a logS value of 1 (corresponding to 10µM) represents a fairly soluble compound but which we only know with 95-percent confidence has an actual aqueous solubility somewhere between 630mM and 0.16nM. Knowing that the real value lies towards one end of the range or the other might have a significant impact upon a decision we make about selecting this compound.
Continue reading below...
An illustration of various colored microbes, including bacteria and viruses
WebinarsCombatting multidrug-resistant bacterial infections
Organic molecules with novel biological properties offer new ways to eliminate multidrug-resistant bacteria.
Read More
And this is just a single property. It is common to base compound selection decisions on criteria for multiple properties. At their simplest, these criteria might just be cutoffs when we believe that an acceptable compound will have a property value on the “right” side of a threshold. When we consider the uncertainty around our data points, even a value on the “right” side of the threshold might have some probability of being unacceptable. In some cases, poor property values with high uncertainties may even represent better opportunities for optimization than very accurate values which fall just short of the necessary criteria.
Adding this information about uncertainty into any data visualization will improve its representation of the true nature of the data, but this comes at the cost of interpretability. Just adding error bars to our plots isn’t likely to solve this problem.
One approach to dealing with this is to use multiparameter optimization (MPO) to generate a score that encapsulates multiple properties. There are several approaches to MPO (Segall, 2012), but by using one that explicitly considers the uncertainty, we can significantly reduce the number of dimensions we need to visualize. Applying this to the set of 5-HT1A compounds, a single visualization can now represent all of the underlying data, giving a more comprehensive picture of which chemical series have the greatest potential. In Figure 2, the compounds have been scored and plotted from left to right in order of descending score. We can see error bars which indicate when we can select between compounds with confidence, but the two highlighted series (arylpiperazines and aminotetralines) are represented by green and pink points, which dominate the left-hand side of the plot where the highest scores lie. This is despite these series not including the most potent compounds, as shown in the histograms below the representative compound displayed for each series.
Continue reading below...
A syringe with a needle drawing the vaccine out of a vial with ampules in the background
InfographicsTurbocharging mRNA vaccine development
Cell-free gene synthesis technology offers a quick, reliable route to creating vital mRNA vaccines and therapeutics.
Read More
This image is no longer available
The visualization of any complex data set is problematic, but even much smaller sets can present challenges. If we consider a subset of the 5-HT1A compounds, comprised of the drug Buspirone and a small number of analogs, we might hope to determine relationships between their structures and two important properties, potency and metabolic stability. The goal in this case is to identify structure-activity relationships (SAR) to design a potent, stable compound. Carrying out an R-group analysis and creating an SAR table for each property (Figure 3 A and B), we can quickly see that the combinations of R-groups which result in the highest potency do not give the best stability, and vice-versa (in both cases, the greenest circles represent the best values). On the other hand, if we create an activity neighborhood diagram, Figure 3 C, we can visualize the relationships between the compounds in a different way. Choosing a representative compound with average potency and stability values to be at the centre, the other compounds are organised in a spiral with the most structurally similar closest to the middle. The cards are colored by their metabolic stabilities, with green being the highest. The links show the difference in potency, with green showing the greatest difference and the arrow showing the direction in which the potency increases. Therefore, a green arrow pointing towards a green card, indicating a more potent and stable compound, easily stands out. Using this approach quickly highlights the original problem, that simply moving a single functional group can modify a compound from being “stable and not potent” to “unstable and potent.” The compound that stands out, however, which is both potent and stable, but not easily apparent from the SAR table, results from a combination of changes—an important point that may otherwise have been overlooked.
Continue reading below...
A 3D illustration of blue antibodies floating toward a green colored virus
InfographicsImmunotherapy for infectious diseases
Many of the same therapies used to activate the immune system against cancer may also combat infectious diseases.
Read More
This image is no longer available
At each decision point, the choice of visualization can be pivotal. The way we perceive our compounds depends upon those around it: Have we explored the surrounding chemical space thoroughly enough to adequately evaluate a series? Do we have data of sufficient quality to confidently distinguish the good compounds from the rest? When it comes to making decisions about compounds, it is often the relationships between compounds that will influence the way we choose the next compound. Any visualizations which simplify or hide these relationships have the potential to bias our perception of the data.

Edmund J. Champness is chief scientific officer of Optibrium Ltd. With a background in Mathematics, he joined GlaxoWellcome in 1995 working as part of a pioneering team building predictive pharmaceutical tools. He developed the first graphical user-interfaces for working with predictive models, which were adopted globally within GlaxoWellcome. He was a core member of the team which established the U.K. operation of Camitro in 2001 and remained with that company through a series of mergers and acquisitions (ArQule, Inpharmatica and BioFocus DPI) until 2008. During this time, he designed and built the StarDrop software and, in 2009, co-founded Optibrium.
Continue reading below...
An illustration of yellow bacteriophages destroying bacteria
InfographicsUsing viruses against bacteria
Antimicrobial resistance poses a significant threat to healthcare. Ultra-microscopic viruses called bacteriophages might hold a solution.
Read More
References
  • Anscombe, F. (1973). Graphs in Statistical Analysis. American Statistician, 27(1), 17-21.
  • Segall, M. (2012). Multi-Parameter Optimization: Identifying high quality compounds with a balance of properties. Current Pharmaceutical Design, 18(9), 1292-1310.http://www.optibrium.com/

About the Author

Related Topics

Published In

Loading Next Article...
Loading Next Article...
Subscribe to Newsletter

Subscribe to our eNewsletters

Stay connected with all of the latest from Drug Discovery News.

Subscribe

Sponsored

Close-up of a researcher using a stylus to draw or interact with digital molecular structures on a blue scientific interface.
When molecules outgrow the limits of sketches and strings, researchers need a new way to describe and communicate them.
Portrait of Scott Weitze, Vice President of Research and Technical Standards at My Green Lab, beside text that reads “Tell us what you know: Bringing sustainability into scientific research,” with the My Green Lab logo.
Laboratories account for a surprising share of global emissions and plastic waste, making sustainability a priority for modern research.
3D illustration of RNA molecules on a gradient blue background.
With diverse emerging modalities and innovative delivery strategies, RNA therapeutics are tackling complex diseases and unmet medical needs.
Drug Discovery News September 2025 Issue
Latest IssueVolume 21 • Issue 3 • September 2025

September 2025

September 2025 Issue

Explore this issue