TY  - JOUR
AU  - Pastrak, Mila
AU  - Kajitani, Sten
AU  - Goodings, Anthony James
AU  - Drewek, Austin
AU  - LaFree, Andrew
AU  - Murphy, Adrian
PY  - 2025
DA  - 2025/3/12
TI  - Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study
JO  - JMIR AI
SP  - e67696
VL  - 4
KW  - artificial intelligence
KW  - ChatGPT-4
KW  - medical education
KW  - emergency medicine
KW  - examination
KW  - examination preparation
AB  - Background: The ever-evolving field of medicine has highlighted the potential for ChatGPT as an assistive platform. However, its use in medical board examination preparation and completion remains unclear. Objective: This study aimed to evaluate the performance of a custom-modified version of ChatGPT-4, tailored with emergency medicine board examination preparatory materials (Anki flashcard deck), compared to its default version and previous iteration (3.5). The goal was to assess the accuracy of ChatGPT-4 answering board-style questions and its suitability as a tool to aid students and trainees in standardized examination preparation. Methods: A comparative analysis was conducted using a random selection of 598 questions from the Rosh In-Training Examination Question Bank. The subjects of the study included three versions of ChatGPT: the Default, a Custom, and ChatGPT-3.5. The accuracy, response length, medical discipline subgroups, and underlying causes of error were analyzed. Results: The Custom version did not demonstrate a significant improvement in accuracy over the Default version (P=.61), although both significantly outperformed ChatGPT-3.5 (P<.001). The Default version produced significantly longer responses than the Custom version, with the mean (SD) values being 1371 (444) and 929 (408), respectively (P<.001). Subgroup analysis revealed no significant difference in the performance across different medical subdisciplines between the versions (P>.05 in all cases). Both the versions of ChatGPT-4 had similar underlying error types (P>.05 in all cases) and had a 99% predicted probability of passing while ChatGPT-3.5 had an 85% probability. Conclusions: The findings suggest that while newer versions of ChatGPT exhibit improved performance in emergency medicine board examination preparation, specific enhancement with a comprehensive Anki flashcard deck on the topic does not significantly impact accuracy. The study highlights the potential of ChatGPT-4 as a tool for medical education, capable of providing accurate support across a wide range of topics in emergency medicine in its default form. 
SN  - 2817-1705
UR  - https://ai.jmir.org/2025/1/e67696
UR  - https://doi.org/10.2196/67696
DO  - 10.2196/67696
ID  - info:doi/10.2196/67696
ER  -