Now showing 1 - 4 of 4
  • Publication
    A field guide for the compositional analysis of any-omics data
    (BioMed Central Ltd, 2019-09)
    Quinn, Thomas P
    ;
    Erb, Ionas
    ;
    Gloor, Greg
    ;
    Notredame, Cedric
    ;
    Richardson, Mark F
    ;

    Background: Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared. Results: Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. Conclusions: In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, "Relative to some important activity of the cell, what is changing?"

  • Publication
    Benchmarking differential expression analysis tools for RNA-Seq: normalization-based vs. log-ratio transformation-based methods
    (BioMed Central Ltd, 2018-07-18)
    Quinn, Thomas P
    ;
    ;
    Richardson, Mark F

    Background: Count data generated by next-generation sequencing assays do not measure absolute transcript abundances. Instead, the data are constrained to an arbitrary "library size" by the sequencing depth of the assay, and typically must be normalized prior to statistical analysis. The constrained nature of these data means one could alternatively use a log-ratio transformation in lieu of normalization, as often done when testing for differential abundance (DA) of operational taxonomic units (OTUs) in 16S rRNA data. Therefore, we benchmark how well the ALDEx2 package, a transformation-based DA tool, detects differential expression in high-throughput RNA-sequencing data (RNA-Seq), compared to conventional RNA-Seq methods such as edgeR and DESeq2.
    Results: To evaluate the performance of log-ratio transformation-based tools, we apply the ALDEx2 package to two simulated, and two real, RNA-Seq data sets. One of the latter was previously used to benchmark dozens of conventional RNA-Seq differential expression methods, enabling us to directly compare transformation-based approaches. We show that ALDEx2, widely used in meta-genomics research, identifies differentially expressed genes (and transcripts) from RNA-Seq data with high precision and, given sufficient sample sizes, high recall too (regardless of the alignment and quantification procedure used). Although we show that the choice in log-ratio transformation can affect performance, ALDEx2 has high precision (i.e., few false positives) across all transformations. Finally, we present a novel, iterative log-ratio transformation (now implemented in ALDEx2) that further improves performance in simulations.
    Conclusions: Our results suggest that log-ratio transformation-based methods can work to measure differential expression from RNA-Seq data, provided that certain assumptions are met. Moreover, these methods have very high precision (i.e., few false positives) in simulations and perform well on real data too. With previously demonstrated applicability to 16S rRNA data, ALDEx2 can thus serve as a single tool for data from multiple sequencing modalities.

  • Publication
    Solving for X: Evidence for sex-specific autism biomarkers across multiple transcriptomic studies
    (John Wiley & Sons, Inc, 2019-09)
    Lee, Samuel C
    ;
    Quinn, Thomas P
    ;
    Lai, Jerry
    ;
    Kong, Sek Won
    ;
    Hertz-Picciotto, Irva
    ;
    Glatt, Stephen J
    ;
    ;
    Venkatesh, Svetha
    ;
    Thin Nguyen

    Autism spectrum disorder (ASD) is a markedly heterogeneous condition with a varied phenotypic presentation. Its high concordance among siblings, as well as its clear association with specific genetic disorders, both point to a strong genetic etiology. However, the molecular basis of ASD is still poorly understood, although recent studies point to the existence of sex-specific ASD pathophysiologies and biomarkers. Despite this, little is known about how exactly sex influences the gene expression signatures of ASD probands. In an effort to identify sex-dependent biomarkers and characterize their function, we present an analysis of a single paired-end postmortem brain RNA-Seq data set and a meta-analysis of six blood-based microarray data sets. Here, we identify several genes with sex-dependent dysregulation, and many more with sex-independent dysregulation. Moreover, through pathway analysis, we find that these sex-independent biomarkers have substantially different biological roles than the sex-dependent biomarkers, and that some of these pathways are ubiquitously dysregulated in both postmortem brain and blood. We conclude by synthesizing the discovered biomarker profiles with the extant literature, by highlighting the advantage of studying sex-specific dysregulation directly, and by making a call for new transcriptomic data that comprise large female cohorts.

  • Publication
    Understanding sequencing data as compositions: an outlook and review
    (ASFRA B V, 2018-08)
    Quinn, Thomas P
    ;
    Erb, Ionas
    ;
    Richardson, Mark F
    ;

    Motivation: Although seldom acknowledged explicitly, count data generated by sequencing platforms exist as compositions for which the abundance of each component (e.g. gene or transcript) is only coherently interpretable relative to other components within that sample. This property arises from the assay technology itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.e. library size). Consequently, sequencing data, as compositional data, exist in a non-Euclidean space that, without normalization or transformation, renders invalid many conventional analyses, including distance measures, correlation coefficients and multivariate statistical models.

    Results: The purpose of this review is to summarize the principles of compositional data analysis (CoDA), provide evidence for why sequencing data are compositional, discuss compositionally valid methods available for analyzing sequencing data, and highlight future directions with regard to this field of study.