NEW YORK—Researchers at the Icahn School of Medicine at Mount Sinai have developed a tool that speeds up the analysis and publication of biomedical data from months or years to minutes, potentially transforming the way researchers can communicate the results of their studies.
Until now, the primary method available to share biomedical research data has been through print publication in scientific journals. This new tool, BioJupies, relies on cloud technologies to analyze and visualize large amounts of data, such as that acquired by genome sequencing. The results were described in an article in the November issue of Cell Systems.
RNA sequencing is the most common experimental method used to profile cells in biomedical research. In recent years, sequencing technology has revolutionized the way scientists examine genetic data, and this advancement plays a crucial role in drug discovery and development. Traditionally, RNA sequencing analysis requires extensive computer programming skills and access to local high-performance computing facilities, slowing down the speed at which biomedical data can be analyzed, shared and published.
“BioJupies is an online software system that guides a user through a step-by-step process. The first step asks the user to upload their raw RNA-sequencing data to our server. Once the data is uploaded, we align the raw reads to the reference genome using a very efficient and low-cost pipeline that is running on the Amazon cloud. This step is currently a bottleneck for many researchers because of cost, access to high-performance computing and required knowledge of computer programming,” says Dr. Avi Ma’ayan, director of the Mount Sinai Center for Bioinformatics, a professor in the Department of Pharmacological Sciences as part of the Icahn Institute for Data Science and the senior author of the publication.
“Once the raw RNA-sequencing data are aligned, the user selects the visualization and analysis tools they would like to apply to the processed data,” Ma’ayan adds. “Once these tools are selected, BioJupies generates a Jupyter notebook analysis report from the data. The analysis report contains interactive figures, tables and text that describes the analysis. The online report also has the source code of the analysis so others can rerun the analysis or modify it and customize it. It is completely free and available to anyone. It is also an open-source project.”
“BioJupies enables the generation of Jupyter Notebooks from RNA-seq data in both raw and processed forms. In case of processed RNA-seq data, the user uploads numeric gene counts in a tabular format,” the authors wrote in the paper. “This can be an Excel spreadsheet or a comma-separated text file containing gene symbols as row names, samples as the column names, and gene counts as values. In addition, metadata that describes the samples can be uploaded in a separate Excel spreadsheet or a comma-separated text file. A detailed explanation of the format to upload the data, including links to download example datasets, is provided on the BioJupies website’s help section.
“In the case of raw RNA-seq data, the user is provided with a user interface that enables them to upload FASTQ files through an HTML form. The user is required to specify the organism, and whether the RNA-seq data were generated using single-end or paired-end sequencing. Once this information is collected, gene expression levels for each gene are quantified by launching parallel jobs in the cloud using the kallisto pseudoaligner (Bray et al., 2016). We have bench- marked kallisto with other aligners and found it to produce comparable count accuracy at a significant lower cost (Lachmann et al., 2018) ... Once the quantification step is complete, which may take up to 15 min, sample counts are merged to generate a gene count matrix. From that point on, the user follows the same steps to generate notebooks as with processed uploaded data (gene counts matrix), by adding sample metadata and selecting the analysis tools they wish to employ.”
Through BioJupies, users reportedly can upload and analyze their RNA sequencing data in a fraction of the time. The platform utilizes a cloud computing pipeline that reduces the cost of RNA-sequencing data processing to less than one cent per sample. BioJupies also produces a complete, open-source, interactive report from the processed data, allowing for 300,000 publicly available RNA sequencing datasets to be fetched, reanalyzed and reused to bolster biomedical research.
“As the amount of biomedical data generated continues to climb exponentially, so should the tools used to analyze and share them,” Ma’ayan continues. “BioJupies not only accelerates the manner in which we analyze and interpret data, but it also provides a completely new way to share results with the global research community.”
As new genomic technologies have allowed for the collection of massive amounts of biomedical information that can be harnessed for precision medicine efforts, the accessibility, interoperability and reusability of this data has become crucial to scientific research. BioJupies paves the way for researchers with no computational background to perform RNA sequencing analysis without the need to collaborate with bioinformaticians, enabling more medical and scientific advancements to flourish in our data-rich world.
“The ultimate purpose of the tool is to make it much easier for experimentalists to analyze and share their data. Ultimately, the notebooks generated by BioJupies are similar to a research publication but they also contain source code and interactive figures ... So BioJupies, and tools like it, may change the way researchers publish their work,” adds Ma’ayan. “The approach can be expanded to handle other data types. We hope that the community will contribute analysis plug-ins so users will be able to have a greater selection of analysis tools to choose from. We also plan to have a place for people to publish their notebooks in an online journal so notebooks can be cited by other notebooks and other forms of publication.”
BioJupies is freely available as a web-based application at http://biojupies.cloud.