Published on in Vol 3 (2024)

Preprints (earlier versions) of this paper are available at, first published .
Feasibility of Multimodal Artificial Intelligence Using GPT-4 Vision for the Classification of Middle Ear Disease: Qualitative Study and Validation

Feasibility of Multimodal Artificial Intelligence Using GPT-4 Vision for the Classification of Middle Ear Disease: Qualitative Study and Validation

Feasibility of Multimodal Artificial Intelligence Using GPT-4 Vision for the Classification of Middle Ear Disease: Qualitative Study and Validation

Original Paper

1Department of Otolaryngology, Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan

2Department of Otolaryngology - Head and Neck Surgery, Shinshu University, Matsumoto, Japan

3College of Transdisciplinary Sciences for Innovation, Kanazawa University, Kanazawa, Japan

Corresponding Author:

Masao Noda, MBA, MD, PhD

Department of Otolaryngology, Head and Neck Surgery

Jichi Medical University

3311-1 Yakushiji

Shimotsuke, 329-0498


Phone: 81 285442111


Background: The integration of artificial intelligence (AI), particularly deep learning models, has transformed the landscape of medical technology, especially in the field of diagnosis using imaging and physiological data. In otolaryngology, AI has shown promise in image classification for middle ear diseases. However, existing models often lack patient-specific data and clinical context, limiting their universal applicability. The emergence of GPT-4 Vision (GPT-4V) has enabled a multimodal diagnostic approach, integrating language processing with image analysis.

Objective: In this study, we investigated the effectiveness of GPT-4V in diagnosing middle ear diseases by integrating patient-specific data with otoscopic images of the tympanic membrane.

Methods: The design of this study was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images. In total, 305 otoscopic images of 4 middle ear diseases (acute otitis media, middle ear cholesteatoma, chronic otitis media, and otitis media with effusion) were obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023. The optimized GPT-4V settings were established using prompts and patients’ data, and the model created with the optimal prompt was used to verify the diagnostic accuracy of GPT-4V on 190 images. To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images.

Results: The multimodal AI approach achieved an accuracy of 82.1%, which is superior to that of certified pediatricians at 70.6%, but trailing behind that of otolaryngologists at more than 95%. The model’s disease-specific accuracy rates were 89.2% for acute otitis media, 76.5% for chronic otitis media, 79.3% for middle ear cholesteatoma, and 85.7% for otitis media with effusion, which highlights the need for disease-specific optimization. Comparisons with physicians revealed promising results, suggesting the potential of GPT-4V to augment clinical decision-making.

Conclusions: Despite its advantages, challenges such as data privacy and ethical considerations must be addressed. Overall, this study underscores the potential of multimodal AI for enhancing diagnostic accuracy and improving patient care in otolaryngology. Further research is warranted to optimize and validate this approach in diverse clinical settings.

JMIR AI 2024;3:e58342



The emergence of artificial intelligence (AI) has altered the landscape of medical technology, particularly in diagnosis, which leverages the identification of features based on imaging and physiological data [1-3]. In the field of otolaryngology, AI and deep learning models are being used for imaging; ongoing efforts focus on classifying diseases based on tympanic membrane images of middle ear disease [4-6]. Technological advancements, including deep learning and transfer learning using pretrained models, have resulted in an accuracy range of 70%-90% in models for analyzing otoscopic images [7]. There have also been advancements in its application, such as implementing smartphone-based point-of-care diagnostics [8]. However, these models rely on trained image data, require large image data sets, and do not consider patient information or clinical context. Consequently, the universality of these models is limited, and their optimal application in clinical practice remains unclear.

Recently, large-scale language-processing models have become available for general use. Further, 1 such model, the GPT-4, has demonstrated specialist-level medical knowledge through its language-processing abilities [9-11]. Since October 2023, GPT-4 Vision (GPT-4V) has gained the ability to evaluate image data, enabling a multimodal diagnostic approach that incorporates both language processing and image analysis [12]. GPT-4V enables the integration of patient information analysis and image-based deep learning models, providing valuable support in diagnosis and treatment, similar to decisions made in a clinical setting [13]. Multimodal AI, which bases diagnosis on multiple pieces of information, has been reported to be more effective than methods that rely on a single type of information. This is demonstrated in various medical applications, including the combination of pathology images with genomic information [14] and their use in liver cancer [15] and cervical cancer [16], where imaging information is integrated. In otorhinolaryngology, there have been few reports; however, efforts to incorporate AI for otoscopic images could further improve the quality of care.

In this study, we aimed to investigate the effectiveness of a multimodal approach using GPT-4V to diagnose middle ear disease. This approach was designed to integrate patient-specific data (age, sex, and chief complaint) with tympanic membrane images to assess the accuracy of the versatile GPT-4V. The model’s accuracy was compared with physicians’ diagnoses to validate its effectiveness in image-based deep learning. The potential future development of the multimodal AI approach for classifying middle ear diseases is also discussed.

Study Design

GPT-4V has been available as an image recognition model since September 25, 2023. This study’s design was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images (Figure 1).

Figure 1. Overview of this study. The model was divided into two phases: (1) establishment and (2) tuned model validation.TM: tympanic membrane.

Correct Otoscopic Images and Patient Information

This study included 305 otoscopic images of middle ear disease obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023. The endoscope used was an Olympus ENF-VH and ENF-V3 (Olympus), and the video system was an Olympus VISERA ELITE OTV-S190. Further, 1 image was obtained from each patient. We excluded images with poor quality and those in which multiple diseases were suspected. The remaining images were classified into 4 disease categories: acute otitis media (AOM), middle ear cholesteatoma (chole), chronic otitis media (COM), and otitis media with effusion (OME). The final diagnoses were based on the judgment of the otolaryngologists who treated the patients. These images were accompanied by patient-specific information, such as age, sex, and chief complaint (eg, fever, otalgia, otorrhea, ear fullness, deafness, facial palsy, dizziness, and tinnitus). We excluded images taken after otologic surgery. Of note, only 1 image was obtained from each patient.

GPT-4V Settings and Prompt Tuning

The GPT-4V settings were established using prompts reported in previous studies [17,18]. Briefly, conditions and prompts for providing answers were verified using 10 images for each disease. According to a report on prompts [19], image data or patient information were manually input into GPT-4V, and the generated results were evaluated by the physicians (MN and HY).

Accuracy Verification of GPT-4V Using the Optimal Prompt Model

The model with the optimal prompt created was used to verify the diagnostic accuracy of GPT-4V on 190 images (37 in AOM, 53 in chole [6 in congenital, 47 in acquired], 51 in COM, and 49 in OME), which were different from those for tuning prompts. To account for the variability in responses, each administration was performed 3 times, and responses that were answered 2 or more times were considered to be the actual response.

Comparison of AI Accuracy With Physician Accuracy

To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images.

The web-based survey included tympanic membrane images and patient information (age, sex, and chief complaint) in a 4-choice question format. The respondents included 8 certificated pediatricians, 8 otolaryngology residents, 8 certificated otolaryngologists, and 6 experts in otolaryngology (more than 15 years of experience).

To show the trend in the percentage of correct responses according to the difficulty of the questions, the questions were divided into 3 levels (easy, normal, and hard) according to the overall percentage of correct responses by physicians, and the percentage of correct responses for each level and each question was compared between the GPT-4V and all doctors, otolaryngologists, and pediatricians.

Ethical Considerations

Patient information was anonymized to protect privacy and used only with the approval of the Ethics Committee of the Shinshu University School of Medicine (6088).

Statistical Analysis

Groups were compared by 1-way ANOVA. Subsequently, multiple comparison tests (the Bonferroni method) were used to compare groups. Statistical significance was set at P<.05. A 1-sample proportion test was used to compare the performance of the physician with that of GPT-4V in terms of the correct response rate.

Establishment of Optimal Prompts

In the initial stage, we sought an optimal input method using 10 images for each disease (AOM, chole, COM, or OME; 40 images total). First, we input only images or options; GPT mostly requires clinical information, such as patient history and symptoms, although no response regarding the disease was generated (Figure 1 and Multimedia Appendix 1). Second, the names of the 4 diseases were added as candidate answers, but again, no response regarding the disease was generated. When detailed patient information, such as age, sex, and main symptoms, was inputted, GPT-4V provided answers, indicating that input images with patient data were the optimal prompt for testing the accuracy of GPT-4V.

Accuracy Validation of the Multimodal AI Approach

The performance of the multimodal AI approach in this study for classifying middle ear diseases was validated, with an overall diagnostic accuracy of 82.1% for the GPT-4V-based analysis. Disease-specific accuracy rates were 89.2% for AOM (true positives [TP]=33, false positives [FP]=1, false negatives [FN]=4, precision=0.97, recall=0.89, F1-score=0.93), 76.5% for COM (TP=39, FP=7, FN=12, precision=0.85, recall=0.76, F1-score=0.8), 79.3% for cholesteatoma (TP=42, FP=13, FN=11, precision=0.76, recall=0.79, F1-score=0.78), and 85.7% for OME (TP=42, FP=10, FN=7, precision=0.81, recall=0.86, F1-score=0.83; Figure 2).

These results indicate high discrimination among various disease types; however, there were also some incorrect responses. Representative images of correct and incorrect GPT-4V classifications for each disease are shown in Figure 3.

Figure 2. Confusion matrix of GPT-4V for classifying 4 middle ear diseases. AOM: acute otitis media; chole: middle ear cholesteatoma; COM: chronic otitis media; GPT-4V: GPT-4 Vision; OME: otitis media with effusion.
Figure 3. Representative images of correct and incorrect GPT-4V classifications for 4 middle ear diseases. The left side shows the correct images for GPT-4V classification, and the right side shows the incorrect images for GPT-4V. AOM: acute otitis media; chole: middle ear cholesteatoma; COM: chronic otitis media; GPT-4V: GPT-4 Vision; OME: otitis media with effusion.

Comparison of Diagnostic Accuracy by Physicians and GPT-4V

The same images with patients’ information used by GPT-4V were evaluated by pediatricians (n=8), otolaryngology residents (n=8), certificated otolaryngologists (n=8), and experts in otolaryngology (n=6), and the diagnostic accuracy of each group was compared. The mean diagnostic accuracy was 70.6% (SE 4.2%) for pediatricians, 95.5% (SE 1%) for otolaryngology residents, 97.3% (SE 0.8%) for certificated otolaryngologists, and 98.2% (SE 0.4%) for experts in otolaryngology. ANOVA revealed significant differences among the 4 groups (F1=13.43, P<.001). In the post hoc comparison, a significant difference was observed between pediatricians and the other 3 groups (P<.001). The GPT-4V correct response rate was 82.1%, surpassing that of pediatricians by 11.5% and trailing behind otolaryngologists by an average of just over 10% (Figure 4).

The accuracy rates for specific diseases were as follows: 92.3% for AOM (pediatricians 80.4%, otolaryngology residents 94.9%, certificated otolaryngologists 97%, and experts in otolaryngology 98.2%), 95.9% for COM (pediatricians 89.5%, otolaryngology residents 96.6%, certificated otolaryngologists 99.8%, and experts in otolaryngology 98.4%), 81.8% for chole (pediatricians 46%, otolaryngology residents 93.2%, certificated otolaryngologists 93.6%, and experts in otolaryngology 98.4%), and 93.7% for OME (pediatricians 81.6%, otolaryngology residents 97.2%, certificated otolaryngologists 99%, and experts in otolaryngology 98%).

In the confusion matrix of all doctors, there was a notable tendency to misclassify chole as OME and AOM as OME. Among pediatricians, there were more errors in classifying chole as AOM or COM (Figure 5).

Figure 4. Result of human validations with doctors of TM images and patients’ data. The graph shows the average correct rate for doctors (pediatricians, otolaryngology residents, certificated otolaryngologists, and experts in otolaryngology), and the dotted line shows the correct answer rate of GPT-4V. GPT-4V: GPT-4 Vision; TM: tympanic membrane. **P value <.01.
Figure 5. Confusion matrix of doctors (pediatricians, otolaryngology residents, certificated otolaryngologists, and experts in otolaryngology) for classifying 4 middle ear diseases. (A) Confusion matrix of all doctors (N=30). The average (percentage of total responses) is shown. (B) Confusion matrix of doctors in each group: pediatricians (n=8), otolaryngology residents (n=8), certificated otolaryngologists (n=8), and experts in otolaryngology (n=6). The averages of each group (percentage of total responses) are shown. AOM: acute otitis media; chole: cholesteatoma; COM: chronic otitis media; OME: otitis media with effusion.

Regarding the difference in the trend of the percentage of correct answers between GPT-4V and physicians according to the difficulty of the questions, even the percentage of correct answers for GPT-4V tended to decrease gradually from 85.7% for easy, 84% for normal, and 71.1% for hard questions (Table 1).

Furthermore, compared with otolaryngologists, GPT-4V had a significantly lower percentage of correct answers for all questions (99.7% for easy, 97.1% for normal, and 90.8% for hard questions; all P<.001). In contrast, the results of the “hard” and “normal” groups were similar. Compared with pediatricians, the GPT-4V outperformed the pediatricians in easy questions with 96.6%, although no statistically significant difference was observed (P=.006). However, the GPT-4V had a predominantly higher percentage of correct answers for normal (76.3%, P=.07) and hard questions (45.4%, P<.001).

Table 1. Comparison of the scores by GPT-4 Vision (GPT-4V) and human validation with physicians across various difficulty levels (N=190).
Difficulty levelQuestions, n (%)GPT-4V (mean %)All doctorsOtolaryngologistsPediatricians

Mean % (95% CI)DifferencesP valueMean % (95% CI)DifferencesP valueMean % (95% CI)DifferencesP value
Easy (>95%)77 (40.5)85.797.8 (97.4-98.2)12.1<.001a99.7 (99.5-99.9)14.0<.001a96.6 (95.3-97.9)10.9.006a
Normal (>85%, <95%)75 (39.5)8490.4 (89.7-91.0)6.4.1397.1 (96.2-98.0)13.1<.001a76.3 (73.6-79.0)–7.7.07
Hard (<85%)38 (20)71.176.8 (73.7-79.8)5.7.4490.8 (87.2-94.3)19.7<.001a45.4 (39.5-51.3)–25.7<.001a

aStatistically significant.

Principal Results

In this study, we assessed the accuracy of the GPT-4V multimodal AI approach in classifying middle ear disorders, yielding the following three key findings. First, GPT-4V, a general-purpose model focusing on large-scale language models, achieved approximately 80% accuracy in classifying middle ear disease. The model’s performance, evaluated using images and patient data, was superior to that of nonotolaryngologists, although it was lower than the average accuracy of otolaryngologists. Second, the GPT-4V was able to classify diseases when patient information and disease options were input. Further improvements in accuracy could be achieved with more detailed patient information. Third, accuracy varied by disease, suggesting the potential for optimizing AI usage and improving accuracy by understanding the specificity of GPT-4V in classifying particular diseases.

Comparison With Prior Work

The GPT-4V model has undergone training and uses 0-shot learning, which recognizes image features based on natural language to classify diseases based on image information and previously learned disease features [20]. GPT-4V can yield effective results with fewer resources than previous deep learning models, which typically require a large amount of image data, computational resources, time, and parameter adjustments for training. By inputting new information rather than simply classifying image data, it becomes possible to tailor diagnoses and diagnostic aids for each individual. Furthermore, GPT-4V and other large-scale language processing models feature prompt development that is appropriate for its usage purposes, since the accuracy of such models varies depending on the prompt adjustments.

Compared with physicians’ accuracy, the model’s performance in this study was higher than that of a pediatrician but lower than that of an otolaryngologist. In a previous comparison between deep learning and humans, Crowson et al [21] classified 22 tympanic membrane images and found that the deep learning model achieved an accuracy of 95.5%, compared with an accuracy of 65% for 39 clinicians. Suresh et al [22] also reported that a machine-learning model created from 1000 images was more effective than pediatricians, with an accuracy rate of 90.6%, surpassing the clinicians’ accuracy of 59.4%. Our results indicated that the model did not reach the proficiency level of otolaryngologists; however, it could be valuable for using tympanic membrane images in medical practice outside of otolaryngology. In particular, GPT-4V judgments predominantly exceeded pediatricians' correct response rates for questions with normal to hard difficulty, suggesting that the present model may be useful for nonotolaryngologists who have difficulty in making such judgments.

Moreover, previous reports on deep learning classification models have determined the presence or absence of inflammation and exudates based on photographs alone. Further studies are needed to identify the optimal stage in the examination for implementing the image classification model and the subsequent policy decisions that should follow.

GPT-4V allows for the classification of diseases using patient information. While comments about medical or harmful content (with restrictions on medical advice) may result in a lower correct response rate, informative or educational responses are still possible if they are well-informed. Efforts have been made to use large language models (LLMs) to improve the accuracy of prompts. Therefore, it is possible to develop appropriate prompts for medical imaging and middle ear disorders. The accuracy of the LLM is expected to further improve with the development of prompts that are specifically tailored for medical imaging and middle ear disease [23,24].

For the clinical application of the GPT-4V model, collecting clinical data and adjusting parameters are needed to further improve its diagnostic accuracy for each middle ear disease. Upon reviewing the incorrect responses of GPT-4V for each disease, we found that chole might demonstrate a retraction pocket, which may be mistaken for a perforation. However, images with keratin debris accumulation in the retraction pocket were less prone to misclassification. In cases of COM with calcification, a white lesion was considered to be chole calcification, emphasizing the importance of distinguishing between these 2 diseases. AOM cases without the chief complaint of acute inflammation (fever, ear pain, or ear discharge) were occasionally misclassified, even with characteristic findings such as a bulging tympanic membrane, suggesting that GPT-4V was likely to prioritize patients' information over images. In OME cases, a white lesion was sometimes considered to be a pearly tumor (chole) or tympanic membrane perforation (COM), particularly when it involved a small amount of effusion or air. For physicians, chole and AOM were often misidentified as other diseases and OME, respectively. When comparing the GPT-4V model with the entire group of physicians, the percentage of correct responses was generally higher among the physicians. However, the GPT-4V diagnostic accuracy for chole was higher than that of pediatricians, indicating that GPT-4V could help nonotolaryngologists diagnose chole. In a previous report, a dedicated AI model had a diagnostic accuracy of approximately 90% for chole [25]; therefore, the combination of such a system and GPT-4V would be useful to improve the accuracy of chole detection.

As demonstrated in this study, the application of AI, including LLM, is believed to offer advantages in terms of improving efficiency and providing assistance in clinical work, enabling the delivery of high-quality medical care, and overcoming language barriers in medicine. The use of GPT-4V has already been reported to diagnose complicated cases [26], and its application can be expanded by integrating it with imaging information. In the field of orthopedics, trials are underway to determine treatment methods based on MRI reports [27], showcasing the effectiveness of GPT-4V as an aid in image interpretation. GPT has been shown to return answers and provide details about the disease, including risk factors and treatment methods. This allows for the evaluation of images alone and assists in medical treatment. Such insights are valuable for understanding the practical use and challenges of AI in real-world applications. Unlike the simplistic deep learning models of the past, the LLM can enhance accuracy by presenting evidence for judgments and asking a series of questions. When used by physicians with a certain level of specialized knowledge, the LLM effectively aids judgment, leading to increased efficiency in medical care. GPT-4V provides answers in just a few seconds, which is significantly shorter than the time it takes a physician to provide a diagnosis, thereby confirming its efficiency. GPT-4V can be used on smartphones, potentially making medical treatment more location-independent. However, there are associated risks, including the reliance on AI for medical care, misdiagnoses due to system malfunctions, and patient information leakage. ChatGPT (OpenAI, Microsoft Corporation) is trained based on information up to a certain period but may respond differently at different times or provide answers using outdated criteria. Furthermore, legal and personal literacy measures must be developed to protect personal information and address ethical concerns. Foreign countries and the United Nations are actively promoting laws and regulations governing the use of AI [28,29].


In total, one limitation of this study is the use of a limited number of images (N=190). Further analysis is required to assess the impact of using a larger data set that encompasses various diseases. Additionally, as there are large variations in the quality of otoscopic images, accurate diagnosis might be challenging in some cases.

The recognition and content of the answers may change depending on the doctor, clinics, and designed prompt; the accuracy may also change due to changes in the image quality used or the method used to capture the image. While this is common to deep learning, the advantage of GPT, which does not require prior training, is that it is not affected by the data to be trained; thus, the possibility of such changes is considered to be small.

For these reasons, further exploration is needed on strategies for handling challenging images and facilitating open-ended responses without giving predefined options. Furthermore, because of the rapid pace of technological evolution, it is essential to regularly fine-tune and make a standalone model that ensures reliability and consistency over time.


A multimodal AI approach using GPT-4V has revealed a potential new diagnostic approach for classifying middle ear diseases. This confirms the ability of AI to assist in clinical diagnosis and identify disease-specific features. The significant improvement in accuracy compared with conventional deep learning models indicates that even general-purpose AI technology can assist in medical treatment with a certain level of accuracy. It can be applied to highly specialized diagnoses, depending on the method. Further improvements in diagnostic accuracy are expected in future studies by integrating more diverse data types.


We are thankful to our colleagues at Shinshu University: Dr Sota Ichimura, Dr Kota Hirose, Dr Mariko Kasuga, Dr Shu Yokota, Dr Kentaro Hori, Dr Arisa Oguchi, Dr Kenjiro Sugiyama, Dr Jun Shinagawa, Dr Yoichiro Iwasa, Dr Keita Tsukada, Dr Tomohiro Oguchi, and Dr Nobuyoshi Suzuki of the Department of Otolaryngology—Head and Neck, and those at Jichi Medical University: Dr Kota Matsuyama, Dr Akiko Uchida, and Dr Yuki Miura of the Department of Otolaryngology and Dr Shinya Fukuda, Dr Kazuki Okumura, and Dr Keizo Wakae of the Department of Pediatrics, and Dr Hitoshi Irabu, Dr Kazuhiro Noguchi, Dr Ryo Nakagawa, Dr Narutoshi Yamazaki, Dr Yuji Takaso, Dr Keisuke Koyama, Dr Yukari Nakamura, Dr Chia Sasaki, and Dr Keigo Nishida for their invaluable cooperation in this study. We thank Editage [30] for English language editing. The authors declare that no financial support was received for the research, authorship, or publication of this paper.

Authors' Contributions

MN handled the conceptualization, data curation, formal analysis, investigation, methodology, project administration, resources, validation, visualization, and writing of the original draft. HY worked on the conceptualization, data curation, formal analysis, investigation, methodology, project administration, resources, validation, and review and editing of the writing. TO, RK, and YU handled the investigation, project administration, and review and editing of the writing. AN, MI, and YT did the supervision and review and editing of the writing.

Conflicts of Interest

None declared.

Multimedia Appendix 1

Representative image and prompt of this study. (A) Representative image of input and output to GPT-4 Vision. Input can be combined with text and images in input to obtain output. (B) Example of changing the prompt content and an output that asks for patient information. By presenting a concept as ORDER and adding conditions as restriction, appropriate prompts were attempted to be developed. In the output, it is required to input patient information such as age, medical history, and chief complaint. (C) An example of an answer with an optimized prompt. Present the diagnosis, the rationale for the diagnosis, and treatment and prevention methods.

PDF File (Adobe PDF File), 483 KB

  1. Kermany DS, Goldbaum M, Cai W, Valentim CCS, Liang H, Baxter SL, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122-1131.e9. [FREE Full text] [CrossRef] [Medline]
  2. Schaefferkoetter J, Yan J, Moon S, Chan R, Ortega C, Metser U, et al. Deep learning for whole-body medical image generation. Eur J Nucl Med Mol Imaging. 2021;48(12):3817-3826. [FREE Full text] [CrossRef] [Medline]
  3. Lee CC, Lin CS, Tsai CS, Tsao TP, Cheng CC, Liou JT, et al. A deep learning-based system capable of detecting pneumothorax via electrocardiogram. Eur J Trauma Emerg Surg. 2022;48(4):3317-3326. [FREE Full text] [CrossRef] [Medline]
  4. Choi Y, Chae J, Park K, Hur J, Kweon J, Ahn JH. Automated multi-class classification for prediction of tympanic membrane changes with deep learning models. PLoS One. 2022;17(10):e0275846. [FREE Full text] [CrossRef] [Medline]
  5. Park YS, Jeon JH, Kong TH, Chung TY, Seo YJ. Deep learning techniques for ear diseases based on segmentation of the normal tympanic membrane. Clin Exp Otorhinolaryngol. 2023;16(1):28-36. [FREE Full text] [CrossRef] [Medline]
  6. Alhudhaif A, Cömert Z, Polat K. Otitis media detection using tympanic membrane images with a novel multi-class machine learning algorithm. PeerJ Comput Sci. 2021;7:e405. [FREE Full text] [CrossRef] [Medline]
  7. Zeng X, Jiang Z, Luo W, Li H, Li H, Li G, et al. Efficient and accurate identification of ear diseases using an ensemble deep learning model. Sci Rep. 2021;11(1):10839. [FREE Full text] [CrossRef] [Medline]
  8. Chen YC, Chu YC, Huang CY, Lee YT, Lee WY, Hsu CY, et al. Smartphone-based artificial intelligence using a transfer learning algorithm for the detection and diagnosis of middle ear diseases: a retrospective deep learning study. EClinicalMedicine. 2022;51:101543. [FREE Full text] [CrossRef] [Medline]
  9. Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492. [FREE Full text] [CrossRef] [Medline]
  10. Jang D, Yun TR, Lee CY, Kwon YK, Kim CE. GPT-4 can pass the Korean national licensing examination for Korean medicine doctors. PLOS Digit Health. 2023;2(12):e0000416. [FREE Full text] [CrossRef] [Medline]
  11. Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ. 2023;9:e48002. [FREE Full text] [CrossRef] [Medline]
  12. GPT-4V(ision) system card. OpenAI. 2023. URL: [accessed 2023-09-25]
  13. Wu W, Yao H, Zhang M, Song Y, Ouyang W, Wang J. GPT4Vis: what can GPT-4 do for zero-shot visual recognition? arXiv. Preprint posted online on March 12 2024. [CrossRef]
  14. Chen RJ, Lu MY, Williamson DFK, Chen TY, Lipkova J, Noor Z, et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell. 08, 2022;40(8):865-878.e6. [FREE Full text] [CrossRef] [Medline]
  15. Zhang L, Jiang Y, Jin Z, Jiang W, Zhang B, Wang C, et al. Real-time automatic prediction of treatment response to transcatheter arterial chemoembolization in patients with hepatocellular carcinoma using deep learning based on digital subtraction angiography videos. Cancer Imaging. 2022;22(1):23. [FREE Full text] [CrossRef] [Medline]
  16. Ming Y, Dong X, Zhao J, Chen Z, Wang H, Wu N. Deep learning-based multimodal image analysis for cervical cancer detection. Methods. 2022;205:46-52. [FREE Full text] [CrossRef] [Medline]
  17. Tanaka Y, Nakata T, Aiga K, Etani T, Muramatsu R, Katagiri S, et al. Performance of generative pretrained transformer on the national medical licensing examination in Japan. PLoS Digit Health. 2024;3(1):e0000433. [FREE Full text] [CrossRef] [Medline]
  18. Masao N, Takayoshi U, Ryota K, Mari SD, Ito M, Yamoto N, et al. A study of the performance of the generative pretrained transformer in the Japanese otorhinolaryngology specialty examination. Nippon Jibiinkoka Tokeibugeka Gakkai Kaiho (Tokyo). 2023;126:1217-1223. [FREE Full text] [CrossRef]
  19. Bharat SM, Myrzakhan A, Shen Z. Principled instructions are all you need for questioning LLaMA-1/2, GPT-3.5/4. arXiv. Preprint posted online on January 18 2024. [CrossRef]
  20. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. arXiv. Preprint posted online on February 26 2021. [CrossRef]
  21. Crowson MG, Bates DW, Suresh K, Cohen MS, Hartnick CJ. "Human vs Machine" validation of a deep learning algorithm for pediatric middle ear infection diagnosis. Otolaryngol Head Neck Surg. 2023;169(1):41-46. [FREE Full text] [CrossRef] [Medline]
  22. Suresh K, Wu MP, Benboujja F, Christakis B, Newton A, Hartnick CJ, et al. AI model versus clinician otoscopy in the operative setting for otitis media diagnosis. Otolaryngol Head Neck Surg. 2023. [FREE Full text] [CrossRef] [Medline]
  23. Hu Y, Chen Q, Du J, Peng X, Keloth VK, Zuo X, et al. Improving large language models for clinical named entity recognition via prompt engineering. J Am Med Inform Assoc. 2024:ocad259. [FREE Full text] [CrossRef] [Medline]
  24. Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res. 2023;25:e50638. [FREE Full text] [CrossRef] [Medline]
  25. Tseng CC, Lim V, Jyung RW. Use of artificial intelligence for the diagnosis of cholesteatoma. Laryngoscope Investig Otolaryngol. 2023;8(1):201-211. [FREE Full text] [CrossRef] [Medline]
  26. Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78-80. [FREE Full text] [CrossRef] [Medline]
  27. Truhn D, Weber CD, Braun BJ, Bressem K, Kather JN, Kuhl C, et al. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci Rep. 2023;13(1):20159. [FREE Full text] [CrossRef] [Medline]
  28. Vlahou A, Hallinan D, Apweiler R, Argiles A, Beige J, Benigni A, et al. Data sharing under the general data protection regulation: time to harmonize law and research ethics? Hypertension. 2021;77(4):1029-1035. [FREE Full text] [CrossRef] [Medline]
  29. Fournier-Tombs E, McHardy J. A medical ethics framework for conversational artificial intelligence. J Med Internet Res. 2023;25:e43068. [FREE Full text] [CrossRef] [Medline]
  30. Editage. URL: [accessed 2024-05-11]

AI: artificial intelligence
AOM: acute otitis media
chole: middle ear cholesteatoma
COM: chronic otitis media
FN: false negative
FP: false positive
GPT-4V: GPT-4 Vision
LLM: large language model
OME: otitis media with effusion
TP: true positive

Edited by Y Huo; submitted 13.03.24; peer-reviewed by S Murono, B Li, J Jagtap; comments to author 10.04.24; revised version received 23.04.24; accepted 07.05.24; published 31.05.24.


©Masao Noda, Hidekane Yoshimura, Takuya Okubo, Ryota Koshu, Yuki Uchiyama, Akihiro Nomura, Makoto Ito, Yutaka Takumi. Originally published in JMIR AI (, 31.05.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR AI, is properly cited. The complete bibliographic information, a link to the original publication on, as well as this copyright and license information must be included.