Dr. Levi Waldron, professor at the CUNY Graduate School of Public Health and Health Policy recently published an article and software package that screens transcriptome databases for duplicates. The work was published in the J. Natl. Cancer Inst.
[Photo: Dr. Levi Waldron]
Whole-genome analysis of cancer specimens is common, and investigators frequently share or re-use specimens in later studies. Re-use of tissue specimens can create a “doppegänger effect” in publicly available datasets: hidden duplicates that, if left undetected, can inflate statistical significance or apparent accuracy of genomic models when combining data from different studies.
Sufficient germ-line sequence markers provide a “fingerprint” that can be matched uniquely in a database of genotypes; publicly available human genomic data are therefore normally summarized at a level that cannot be identified uniquely to protect patient privacy. The research team proposed a method that exploits distinctive alterations to allow highly accurate matching of cancer transcriptome profiles, even in summarized form when nucleotide-level sequence data are unavailable, and even for samples profiled by different microarray technologies or by both microarray and RNA sequencing. They demonstrate the effectiveness of the method in databases containing dozens of datasets and thousands of ovarian, breast, bladder, and colorectal cancer microarray profiles and of matching microarray and RNA sequencing expression profiles from The Cancer Genome Atlas. They identified probable duplicates among more than 50 percent of studies, originating in different continents, using different technologies, published years apart, and even within the TCGA itself.
The research team provides the doppelgängR Bioconductor package for screening transcriptome databases for duplicates. Given the potential for unrecognized duplication to falsely inflate prediction accuracy and confidence in differential expression, doppegänger-checking should be a part of standard procedure for combining multiple genomic datasets.