Abstract
Abstract Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well‐established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10–30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change‐in‐estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p ‐values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low‐dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.
Keywords
MeSH Terms
Affiliated Institutions
Related Publications
MCMC Methods for Multi-Response Generalized Linear Mixed Models: The<b>MCMCglmm</b><i>R</i>Package
Generalized linear mixed models provide a flexible framework for modeling a range of data, although with non-Gaussian response variables the likelihood cannot be obtained in clo...
Latent Class Model Diagnosis
Summary. In many areas of medical research, such as psychiatry and gerontology, latent class variables are used to classify individuals into disease categories, often with the i...
Collinearity: a review of methods to deal with it and a simulation study evaluating their performance
Collinearity refers to the non independence of predictor variables, usually in a regression‐type analysis. It is a common feature of any descriptive ecological data set and can ...
Regression Shrinkage and Selection Via the Lasso
SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients b...
Analysis of Longitudinal Data
1. Introduction 2. Design considerations 3. Exploring longitudinal data 4. General linear models 5. Parametric models for covariance structure 6. Analysis of variance methods 7....
Publication Info
- Year
- 2018
- Type
- review
- Volume
- 60
- Issue
- 3
- Pages
- 431-449
- Citations
- 1460
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1002/bimj.201700067
- PMID
- 29292533
- PMCID
- PMC5969114