@Article{info:doi/10.2196/70566, author="AlFarabi Ali, Sarah and AlDehlawi, Hebah and Jazzar, Ahoud and Ashi, Heba and Esam Abuzinadah, Nihal and AlOtaibi, Mohammad and Algarni, Abdulrahman and Alqahtani, Hazzaa and Akeel, Sara and Almazrooa, Soulafa", title="The Diagnostic Performance of Large Language Models and Oral Medicine Consultants for Identifying Oral Lesions in Text-Based Clinical Scenarios: Prospective Comparative Study", journal="JMIR AI", year="2025", month="Apr", day="24", volume="4", pages="e70566", keywords="artificial intelligence; ChatGPT; Copilot; diagnosis; oral medicine; diagnostic performance; large language model; lesion; oral lesion", abstract="Background: The use of artificial intelligence (AI), especially large language models (LLMs), is increasing in health care, including in dentistry. There has yet to be an assessment of the diagnostic performance of LLMs in oral medicine. Objective: We aimed to compare the effectiveness of ChatGPT (OpenAI) and Microsoft Copilot (integrated within the Microsoft 365 suite) with oral medicine consultants in formulating accurate differential and final diagnoses for oral lesions from written clinical scenarios. Methods: Fifty comprehensive clinical case scenarios including patient age, presenting complaint, history of the presenting complaint, medical history, allergies, intra- and extraoral findings, lesion description, and any additional information including laboratory investigations and specific clinical features were given to three oral medicine consultants, who were asked to formulate a differential diagnosis and a final diagnosis. Specific prompts for the same 50 cases were designed and input into ChatGPT and Copilot to formulate both differential and final diagnoses. The diagnostic accuracy was compared between the LLMs and oral medicine consultants. Results: ChatGPT exhibited the highest accuracy, providing the correct differential diagnoses in 37 of 50 cases (74{\%}). There were no significant differences in the accuracy of providing the correct differential diagnoses between AI models and oral medicine consultants. ChatGPT was as accurate as consultants in making the final diagnoses, but Copilot was significantly less accurate than ChatGPT (P=.015) and one of the oral medicine consultants (P<.001) in providing the correct final diagnosis. Conclusions: ChatGPT and Copilot show promising performance for diagnosing oral medicine pathology in clinical case scenarios to assist dental practitioners. ChatGPT-4 and Copilot are still evolving, but even now, they might provide a significant advantage in the clinical setting as tools to help dental practitioners in their daily practice. ", issn="2817-1705", doi="10.2196/70566", url="https://ai.jmir.org/2025/1/e70566", url="https://doi.org/10.2196/70566" }