Abstract

This study aimed to evaluate the diagnostic accuracy of multimodal large language models in classifying superior labial frenulum attachments from intraoral photographs using expert consensus as the reference standard. Five experts (two periodontists and three orthodontists) established the consensus standard by classifying frenulum attachments in 117 intraoral images as mucosal, gingival, papillary, and papilla penetrating. The same photographs were then presented to three multimodal large language models (ChatGPT 4o, Gemini 2.5 Pro, and Microsoft Copilot GPT-4), and their diagnostic performance was evaluated using accuracy, sensitivity, specificity, and F1 score. Reliability was assessed using Fleiss' and Cohen's Kappa, and diagnostic performances were compared using Cochran's Q test. Human raters demonstrated almost perfect agreement (κ = 0.838, p < 0.001), whereas large language models showed poor inter-model agreement (κ = - 0.124, p < 0.001). ChatGPT achieved slight significant agreement with the consensus (κ = 0.114, p = 0.019), although its clinical relevance was negligible. Gemini (κ = 0.099) and Copilot (κ = 0.027) showed no significant agreement (p > 0.05). Copilot yielded the highest overall accuracy (46.2%), followed by Gemini (44.5%) and ChatGPT (35.0%). The performance of the large language models varied across frenulum types. Current multimodal large language models demonstrate inconsistent and clinically insufficient accuracy in the classification of superior labial frenulum attachments from photographs. Domain-specific training is essential before large language models can be considered reliable diagnostic tools in dentistry.

Keywords

Diagnostic accuracyLabial frenulumLarge language models

Affiliated Institutions

Related Publications

Publication Info

Year
2025
Type
article
Citations
0
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

0
OpenAlex
0
Influential

Cite This

Mehmet Gümüş Kanmaz, Genta Agani Sabah (2025). Diagnostic accuracy of large language models in the classification of superior labial frenulum attachments. Odontology . https://doi.org/10.1007/s10266-025-01283-2

Identifiers

DOI
10.1007/s10266-025-01283-2
PMID
41369714

Data Quality

Data completeness: 77%