Abstract
This study aimed to evaluate the diagnostic accuracy of multimodal large language models in classifying superior labial frenulum attachments from intraoral photographs using expert consensus as the reference standard. Five experts (two periodontists and three orthodontists) established the consensus standard by classifying frenulum attachments in 117 intraoral images as mucosal, gingival, papillary, and papilla penetrating. The same photographs were then presented to three multimodal large language models (ChatGPT 4o, Gemini 2.5 Pro, and Microsoft Copilot GPT-4), and their diagnostic performance was evaluated using accuracy, sensitivity, specificity, and F1 score. Reliability was assessed using Fleiss' and Cohen's Kappa, and diagnostic performances were compared using Cochran's Q test. Human raters demonstrated almost perfect agreement (κ = 0.838, p < 0.001), whereas large language models showed poor inter-model agreement (κ = - 0.124, p < 0.001). ChatGPT achieved slight significant agreement with the consensus (κ = 0.114, p = 0.019), although its clinical relevance was negligible. Gemini (κ = 0.099) and Copilot (κ = 0.027) showed no significant agreement (p > 0.05). Copilot yielded the highest overall accuracy (46.2%), followed by Gemini (44.5%) and ChatGPT (35.0%). The performance of the large language models varied across frenulum types. Current multimodal large language models demonstrate inconsistent and clinically insufficient accuracy in the classification of superior labial frenulum attachments from photographs. Domain-specific training is essential before large language models can be considered reliable diagnostic tools in dentistry.
Keywords
Affiliated Institutions
Related Publications
Prostate-specific membrane antigen PET-CT in patients with high-risk prostate cancer before curative-intent surgery or radiotherapy (proPSMA): a prospective, randomised, multicentre study
Conventional imaging using CT and bone scan has insufficient sensitivity when staging men with high-risk localised prostate cancer. We aimed to investigate whether novel imaging...
Contraction Mappings in the Theory Underlying Dynamic Programming
Next article Contraction Mappings in the Theory Underlying Dynamic ProgrammingEric V. DenardoEric V. Denardohttps://doi.org/10.1137/1009030PDFBibTexSections ToolsAdd to favorite...
Beyond BMI - Phenotyping the Obesities
For more than a decade, researchers in the field of obesity have debated the value of the BMI as the most common and convenient index for classifying the obese condition. The im...
Publication Info
- Year
- 2025
- Type
- article
- Citations
- 0
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1007/s10266-025-01283-2
- PMID
- 41369714