Lloyd Smith, a biochemist at the University of Wisconsin-Madison, known for his extensive expertise in mass spectrometry, began his research journey studying genetics. He invented the first automated DNA sequencer during his postdoctoral research and joined the University of Wisconsin-Madison in 1987, focusing on capillary electrophoresis. The emergence of matrix-assisted laser desorption/ionization and electrospray ionization in the 1990s soon caught his attention. These technologies allowed him to ionize and separate DNA molecules in a mass spectrometer instead of a gel.
“During that time, we ran into an issue with analyzing DNA mixtures,” Smith said. “We couldn’t detect larger fragments as well as we could detect smaller ones, which caused bias in the mass spectrometry data.” Recognizing this was caused by instrumental limitations, Smith developed a technique called charge reduction mass spectrometry that solved the issue. His technique was just as effective for analyzing proteins as for analyzing nucleic acids.

Credit: Lloyd Smith
Since then, Smith shifted his focus to proteomics, pioneering mass spectrometrybased methods that enable the identification of proteoforms, variations of proteins arising from a single gene that play crucial roles in health and disease but are challenging to detect with conventional methods. His team also led the development of advanced software tools to enhance the speed, accuracy, and depth of proteomic analysis. These tools have become instrumental resources for researchers to visualize and interpret complex proteomics data, offering valuable insights into biological systems.
Can you explain the basics of bottom-up proteomics?
In bottom-up proteomics, we usually want to analyze the whole proteomes of cells. We start with lysing the cells to release their contents, spinning out insoluble materials, and isolating the proteins. We then digest these proteins into smaller peptides using enzymes, usually trypsin and sometimes other enzymes. This step converts the protein mixture into a complex peptide mixture. The next step involves liquid chromatography-mass spectrometry to analyze the peptides. We may include intermediate steps like fractionation, where peptides go through a chromatography column and come out separated from other peptides. They are then ionized via electrospray ionization and enter the mass spectrometer.
Mass spectrometry occurs in two steps, known as tandem mass spectrometry. The first step is called MS1, which determines the masses of the peptides as they elute from the column. Specified software then automatically selects the most intense peptide ions for further analysis. In the second step, called MS2, these selected peptide ions fragment inside the mass spectrometer, generating smaller ions that allow us to identify the peptides.
What are the advantages and limitations of bottom-up proteomics compared to top-down proteomics?
Bottom-up proteomics is more widely used than top-down proteomics, which skips the peptide digestion step, because the molecules are less complex. This means we get simpler spectra that are easier to understand, interpret, and generate. However, the downside of bottom-up proteomics is the loss of the context of what else was in the protein. For example, two different forms of a protein, called proteoforms, might produce the same peptide upon digestion, making it difficult to distinguish between them during the analysis. This loss of context is a key difference between bottom-up and top-down proteomics, where the latter maintains the intact protein, providing more detailed information about the protein’s molecular forms.
How should researchers select appropriate enzymes for protein digestion?
Having more enzymes is generally better. Cutting proteins with trypsin can generate both great peptides and less informative peptides that are too long, too short, or too hydrophobic. Using multiple enzymes can create more overlapping sets of peptides, where one enzyme may help reveal sequences missed by another. This overlap helps us stitch together a more complete sequence.
My student Rachel Miller developed a software tool to help researchers select which enzymes to digest their samples. It stimulates the digestion process, predicts the peptide fragments generated from the digestion, and assesses their length, hydrophobicity, and potential protein coverage. This tool helps researchers identify which enzymes will yield optimal results before conducting wet lab experiments, enhancing efficiency without unnecessary experimentation.
What else can help improve data quality during sample preparation?
One effective way to improve data quality is through multidimensional separation. For example, when dealing with a complex sample, using just reverse-phase liquid chromatography generates a lot of co-eluting peptides, which could lead to missed peptide identifications and errors. In mass spectrometry, more separations mean higher data quality, but also takes more time. So, it’s always a trade-off between time and output, especially with complex samples.
What are the approaches for protein identification?
In bottom-up proteomics, there are two main approaches. The first and most common approach involves generating theoretical mass spectra of peptides from an in silico proteome and then matching these with experimental spectra. This helps identify the most probable theoretical peptides from the experimental data. John Yates’ group at Scripps Research developed this strategy and wrote a program called SEQUEST in the 1990s, which became the paradigm under which we operate. The other approach, called de novo sequencing, involves identifying proteins without pre-existing sequence knowledge. However, this method carries a higher risk of protein or peptide misidentification. With this approach, researchers often need to combine multiple enzymes with trypsin to gather more data.
In bottom-up proteomics, the results are typically probabilistic rather than absolute. When analyzing a large number of molecules, we rarely get a definitive answer. Instead, we obtain results with a high probability of correctness, which we call a false discovery rate. This rate accounts for the likelihood that some identifications may be incorrect. Peptide identification methods allow researchers to make informed guesses about which proteoforms are present in the sample. However, these conclusions are based on the data at hand and are not definitive.
What tools do you use to analyze mass spectrometry data?
We’ve developed an open source search engine named MetaMorpheus, which we built based on a tool called Morpheus, created by Craig Wenger in the Joshua Coon group at the University of Wisconsin-Madison. Both MetaMorpheus and Morpheus use a scoring function to match experimental and theoretical spectra for peptide identification. It’s very important that software tools are open source because that allows everyone to understand exactly what they’re doing and how they’re doing it. With open source code, others can look at the code and improve it without needing to reinvent it. We made all our source code available on GitHub, and we assist users who encounter problems. We have also implemented industrial-strength software robustness policies to test the code before new releases. Typically, a student leads a project to develop a new idea. They transform the idea into an algorithm and write the code to execute it. We then integrate this new capability into MetaMorpheus.
Our goal is to develop software capable of identifying all co-eluting peptide variants, even when they overlap, which could significantly increase the throughput of bottom-up proteomics.
What are the applications of MetaMorpheus?
A particularly useful tool in MetaMorpheus is G-PTM-D, which stands for global post-translational modification discovery. This tool helps us identify new post-translational modifications (PTMs). Traditionally, researchers use an algorithm called Variable Modification to find PTMs. This approach requires creating a large database with all possible PTM sites and searching for the best match. This process increases analysis time and error rate due to the large number of incorrect entries in databases.
To address this, we implemented a method that creates smaller databases based on just the observed mass shifts. This allows us to search for all modifications simultaneously and identify PTMs without prior knowledge of what they are or their location. In a recent paper published in the Journal of Proteome Research, we used G-PTM-D to discover previously unknown modifications in the human immunodefficiency virus capsid and matrix proteins that were biologically meaningful.
What are some recent enhancements you’ve made to your data analysis tools?
We’ve added many capabilities over time. For example, we’ve developed a tool called FlashLFQ, with LFQ standing for label-free quantitation. We integrated FlashLFQ into MetaMorpheus, allowing it to work with G-PTM-D to quantify modified peptides accurately. Currently, we’re making algorithmic modifications to improve the accuracy of these tools for single cell proteomic data. This is important because single cell proteomic data is often of poor quality, and using standard software designed for high quality data isn’t effective for low quality, noisy data.

Credit: Lloyd Smith
What advice would you give to researchers new to bottom-up proteomics?
One of the first decisions is choosing the chromatography method. Our group specializes in nanoflow chromatography. It’s very sensitive but also finicky. Alternatively, many facilities use microflow chromatography, which employs larger columns and is less sensitive but simpler to set up. The next critical step is mastering the instrument’s software and operation. There are a lot of decisions to make, such as choosing between data-dependent acquisition and data-independent acquisition, each with its advantages and weaknesses. Once data starts accumulating, the challenge shifts to data analysis. Converting spectra into protein lists requires specialized software, and understanding isotopic or label-free quantitation methods is essential for quantitative analysis. Finally, extracting biological insights from identified proteins involves using a whole host of tools for network and pathway analysis and protein-protein interaction studies. Each step requires attention and patience. For beginners, I recommend learning from colleagues or visiting a specialized lab to get hands-on experience. Using user-friendly software, such as MetaMorpheus, can also be beneficial.
What are the current challenges and future directions you see in proteomics?
One of the key challenges we face is the complexity of our samples. Often, multiple peptide variants co-elute from the chromatographic column and enter the mass spectrometer simultaneously, resulting in messy spectra. Current software typically identifies the most prominent peptide variants, leaving others unidentified. Our goal is to develop software capable of identifying all co-eluting peptide variants, even when they overlap, which could significantly increase the throughput of bottom-up proteomics. That’s an exciting opportunity to advance the field.
In broader terms, bottom-up proteomics is relatively mature, benefiting from substantial advancements in instrumentation and methodology over the years. The real frontier in proteomics is top-down proteomics, particularly in proteoform analysis. These areas are exciting and challenging, with many new things to do and a lot of room to improve.
This interview has been condensed and edited for clarity.