New bioinformatics methods unlock analysis of major public genomics projects

Dr. Levi Waldron, investigator here with the ISPH and professor at the CUNY Graduate School of Public Health and Health Policy (CUNY SPH), has published two bioinformatics papers in the journals Cancer Research and Nature Methods. These papers present novel databases and bioinformatic methods that enable effective analyses of major cancer and human microbiome datasets for a much broader range of researchers than could previously utilize these publicly generated resources. The methods are implemented as components of the free R and Bioconductor software for statistical analysis of high-throughput biological data. “These works bridge gaps between big and difficult to manage data, and the public health researchers with ideas to pursue,” explains Waldron.

The first of these papers, published in Cancer Research, was led by an alumnus of the CUNY SPH biostatistics master’s program, Marcel Ramos. The paper presents a novel data structure for representing and analyzing multi-omics experiments: a biological analysis approach utilizing multiple types of observations, such as DNA mutations and abundance of RNA and proteins, in the same biological specimens. These kinds of experiments generate comprehensive molecular portraits of cancer tumors and other biological tissues but can be extremely complex to analyze. The published method introduces a network representation linking each observation to its patient and associated clinical data, providing an integrative representation for any number of heterogeneous kinds of measurements. This harmonized representation provides researchers and other methods developers with a simpler interface for previously complicated and error-prone analysis procedures.

The method and its software implementation are applicable across numerous diseases and data types. The team integrated 12 types of molecular data with clinical and pathological information from over 11,000 patients of 33 different cancer types from The Cancer Genome Atlas (TCGA), a nationwide project of the National Cancer Institute, and made these integrated data publicly available. Whereas other software has provided downloading capabilities for these data, or integrated small subsets of it, this work represents the first comprehensive integration of the TCGA data. The authors demonstrate how previously laborious analyses, such as correlating the rates of DNA copy number alterations to the rates of somatic mutation in breast and colorectal cancer, can be accomplished in several lines of code.

The second paper, published in Nature Methods, is the result of a collaboration that started during Waldron’s period as a Fulbright scholar to Italy in 2016, and is co-led by another alumnus of the CUNY SPH biostatistics master’s program, Lucas Schiffer. This project provides an integrated database of publicly available human microbiome profiles that were generated by whole-metagenome “shotgun” sequencing. This method involves sequencing the combined DNA of all microbes present at various sites of the human body to determine which microbes are present and their potential for metabolic function based on the microbial genes that are present.

The team developed a pipeline to download, process, integrate, and re-distribute six data types from over 6,000 human subjects suffering from 34 types of disease in 28 countries. The database includes the Human Microbiome Project, an initiative of the National Institutes of Health to characterize the healthy human oral, skin, vaginal, gut, and nasal/lung microbiome. The team downloaded 63 TB of raw sequencing data and consumed over 100,000 CPU hours at the CUNY high-performance computing center to align these sequencing reads to marker genes from ~17,000 microbial reference genomes, determining the microbial species present and metabolic functions they may be performing. Sharing these data required developing a novel method for cloud-based distribution of such large databases directly into the R/Bioconductor analysis environment, used also to distribute the cancer data described above.

The group used this new database to refute the existence of previously reported “enterotypes” of the human gut microbiome, and to show that the presence of certain diseases, such as liver cirrhosis, colorectal cancer, and type-2 diabetes, could be predicted equally accurately by machine learning techniques using any of the data types provided.

Discussing the significance of both studies, Waldron explains, “Our hope is that patients who donate their specimens and their DNA to help find preventions and cures for their disease, will see the importance of their contributions as newly empowered researchers from around the world are able to set to work turning their data into discoveries.”

In addition to Waldron, Ramos, and Schiffer, a number of CUNY SPH faculty, students, and alumni were involved with the studies. Dr. Jennifer Dowd, a research associate professor, along with current student Audrey Renson, and alumni Carmen Rodriguez, Tiffany Chan, and Hanish Kodali all contributed to the studies.

1. Ramos M, Schiffer L, Re A, Azhar R, Basunia A, Rodriguez C, Chan T, Chapman P, Davis SR, Gomez-Cabrero D, Culhane AC, Haibe-Kains B, Hansen KD, Kodali H, Louis MS, Mer AS, Riester M, Morgan M, Carey V, Waldron L: Software for the Integration of Multiomics Experiments in Bioconductor. Cancer Research 2017, 77:e39–e42.

2. Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong DT, Beghini F, Malik F, Ramos M, Dowd JB, Huttenhower C, Morgan M, Segata N, Waldron L: Accessible, curated metagenomic data through ExperimentHub. Nature Methods 2017, 14:1023–1024.