Abstract

Phylogenetic estimation has largely come to rely on explicitly model-based methods. This approach requires that a model be chosen and that that choice be justified. To date, justification has largely been accomplished through use of likelihood-ratio tests (LRTs) to assess the relative fit of a nested series of reversible models. While this approach certainly represents an important advance over arbitrary model selection, the best fit of a series of models may not always provide the most reliable phylogenetic estimates for finite real data sets, where all available models are surely incorrect. Here, we develop a novel approach to model selection, which is based on the Bayesian information criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework. This DT method includes a penalty for overfitting, is applicable prior to running extensive analyses, and simultaneously compares all models being considered and thus does not rely on a series of pairwise comparisons of models to traverse model space. We evaluate this method by examining four real data sets and by using those data sets to define simulation conditions. In the real data sets, the DT method selects the same or simpler models than conventional LRTs. In order to lend generality to the simulations, codon-based models (with parameters estimated from the real data sets) were used to generate simulated data sets, which are therefore more complex than any of the models we evaluate. On average, the DT method selects models that are simpler than those chosen by conventional LRTs. Nevertheless, these simpler models provide estimates of branch lengths that are more accurate both in terms of relative error and absolute error than those derived using the more complex (yet still wrong) models chosen by conventional LRTs. This method is available in a program called DT-ModSel.

Keywords

OverfittingModel selectionPairwise comparisonNested set modelSelection (genetic algorithm)Bayesian probabilitySeries (stratigraphy)Bayesian information criterionInformation CriteriaComputer scienceMathematicsStatisticsAlgorithmData miningMachine learning

Affiliated Institutions

Related Publications

Introduction to Econometrics

Foreword. Preface to the Second Edition. Preface to the Third Edition. Obituary. INTRODUCTION AND THE LINEAR REGRESSION MODEL. What is Econometrics? Statistical Background and M...

2020 WORLD SCIENTIFIC eBooks 3511 citations

Publication Info

Year
2003
Type
article
Volume
52
Issue
5
Pages
674-683
Citations
423
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

423
OpenAlex

Cite This

Vladimir N. Minin, Zaid Abdo, Paul Joyce et al. (2003). Performance-Based Selection of Likelihood Models for Phylogeny Estimation. Systematic Biology , 52 (5) , 674-683. https://doi.org/10.1080/10635150390235494

Identifiers

DOI
10.1080/10635150390235494