Abstract
Abstract The benefit of integrating batches of genomic data to increase statistical power is often hindered by batch effects, or unwanted variation in data caused by differences in technical factors across batches. It is therefore critical to effectively address batch effects in genomic data to overcome these challenges. Many existing methods for batch effects adjustment assume the data follow a continuous, bell-shaped Gaussian distribution. However in RNA-seq studies the data are typically skewed, over-dispersed counts, so this assumption is not appropriate and may lead to erroneous results. Negative binomial regression models have been used previously to better capture the properties of counts. We developed a batch correction method, ComBat-seq, using a negative binomial regression model that retains the integer nature of count data in RNA-seq studies, making the batch adjusted data compatible with common differential expression software packages that require integer counts. We show in realistic simulations that the ComBat-seq adjusted data results in better statistical power and control of false positives in differential expression compared to data adjusted by the other available methods. We further demonstrated in a real data example that ComBat-seq successfully removes batch effects and recovers the biological signal in the data.
Keywords
Affiliated Institutions
Related Publications
HTSeq—a Python framework to work with high-throughput sequencing data
Abstract Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from stand...
Covariance Analysis of Censored Survival Data
The use of regression models for making covariance adjustments in the comparsion of survival curves is illustrated by application to a clinical trial of maintenance therapy for ...
GEPIA2: an enhanced web server for large-scale expression profiling and interactive analysis
Abstract Introduced in 2017, the GEPIA (Gene Expression Profiling Interactive Analysis) web server has been a valuable and highly cited resource for gene expression analysis bas...
PADGE: analysis of heterogeneous patterns of differential gene expression
We have devised a novel analysis approach, percentile analysis for differential gene expression (PADGE), for identifying genes differentially expressed between two groups of het...
Publication Info
- Year
- 2020
- Type
- article
- Volume
- 2
- Issue
- 3
- Pages
- lqaa078-lqaa078
- Citations
- 1486
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1093/nargab/lqaa078
- PMID
- 33015620
- PMCID
- PMC7518324