Abstract

National Health and Nutrition Examination Survey (NHANES), one of the largest curated repositories of population-level health indicators including physical examinations, blood/urine biochemistry, self-reported surveys, and dietary intake, offers rich resources for oral health research but presents challenges for machine learning analysis due to heterogeneity, missing data, and complexity. Dental caries, the most prevalent chronic disease worldwide, is a multifactorial disease and exhibits variability in clinical manifestation, calling for advanced analytical approaches for deeper understanding. Here, we develop an integrated data-cleaning and subtype discovery pipeline using unsupervised machine learning for comprehensive analysis and visualization of data patterns in the NHANES database. Our multidimensional pipeline declutters and optimizes the NHANES dataset by addressing missingness and outliers to streamline data integration and create a machine learning–ready version. Applying this pipeline reveals data patterns that led to the discovery of previously unrecognized subtypes and variables associated with the clinical heterogeneity of dental caries. We observed diverging patterns of similarity across age groups and variable subsets, identifying distinct clusters particularly in children (<5 y) and senior adults (>65 y). We also discovered unexpected associations involving lead exposure and specific laboratory markers and, importantly, identified novel dietary signatures by linking food type and co-occurring consumption patterns to caries. Altogether, we report a comprehensive data-processing and data-analysis approach that reveals significant dental caries heterogeneity in NHANES data and can support the development of more precise and robust machine learning models for dental caries and other health conditions.

Affiliated Institutions

Related Publications

Support Vector Machines

The fundamental algorithms in data mining and machine learning form the basis of data science, utilizing automated methods to analyze patterns and models for all kinds of data i...

2020 Cambridge University Press eBooks 1479 citations

Much Ado About Nothing

Missing data are a recurring problem that can cause bias or lead to inefficient analyses. Development of statistical methods to address missingness have been actively pursued in...

2007 The American Statistician 759 citations

Publication Info

Year
2025
Type
article
Pages
220345251398027-220345251398027
Citations
0
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

0
OpenAlex

Cite This

Alena Orlenko, Justin D Mure, Joan I. Gluch et al. (2025). Uncovering Dental Caries Heterogeneity in NHANES Using Machine Learning. Journal of Dental Research , 220345251398027-220345251398027. https://doi.org/10.1177/00220345251398027

Identifiers

DOI
10.1177/00220345251398027