TY - JOUR AU - Yan, Chao AU - Zhang, Ziqi AU - Nyemba, Steve AU - Li, Zhuohang PY - 2024/4/22 TI - Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial JO - JMIR AI SP - e52615 VL - 3 KW - synthetic data generation KW - electronic health record KW - generative neural networks KW - tutorial UR - https://ai.jmir.org/2024/1/e52615 UR - http://dx.doi.org/10.2196/52615 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875595 ID - info:doi/10.2196/52615 ER - TY - JOUR AU - Quttainah, Majdi AU - Mishra, Vinaytosh AU - Madakam, Somayya AU - Lurie, Yotam AU - Mark, Shlomo PY - 2024/4/23 TI - Cost, Usability, Credibility, Fairness, Accountability, Transparency, and Explainability Framework for Safe and Effective Large Language Models in Medical Education: Narrative Review and Qualitative Study JO - JMIR AI SP - e51834 VL - 3 KW - large language model KW - LLM KW - ChatGPT KW - CUC-FATE framework KW - cost, usability, credibility, fairness, accountability, transparency, and explainability KW - analytical hierarchy process KW - AHP KW - total interpretive structural modeling KW - TISM KW - medical education KW - adoption KW - guideline KW - development KW - health care KW - chat generative pretrained transformer KW - generative language model tool KW - user KW - innovation KW - data generation KW - narrative review KW - health care professional N2 - Background: The world has witnessed increased adoption of large language models (LLMs) in the last year. Although the products developed using LLMs have the potential to solve accessibility and efficiency problems in health care, there is a lack of available guidelines for developing LLMs for health care, especially for medical education. Objective: The aim of this study was to identify and prioritize the enablers for developing successful LLMs for medical education. We further evaluated the relationships among these identified enablers. Methods: A narrative review of the extant literature was first performed to identify the key enablers for LLM development. We additionally gathered the opinions of LLM users to determine the relative importance of these enablers using an analytical hierarchy process (AHP), which is a multicriteria decision-making method. Further, total interpretive structural modeling (TISM) was used to analyze the perspectives of product developers and ascertain the relationships and hierarchy among these enablers. Finally, the cross-impact matrix-based multiplication applied to a classification (MICMAC) approach was used to determine the relative driving and dependence powers of these enablers. A nonprobabilistic purposive sampling approach was used for recruitment of focus groups. Results: The AHP demonstrated that the most important enabler for LLMs was credibility, with a priority weight of 0.37, followed by accountability (0.27642) and fairness (0.10572). In contrast, usability, with a priority weight of 0.04, showed negligible importance. The results of TISM concurred with the findings of the AHP. The only striking difference between expert perspectives and user preference evaluation was that the product developers indicated that cost has the least importance as a potential enabler. The MICMAC analysis suggested that cost has a strong influence on other enablers. The inputs of the focus group were found to be reliable, with a consistency ratio less than 0.1 (0.084). Conclusions: This study is the first to identify, prioritize, and analyze the relationships of enablers of effective LLMs for medical education. Based on the results of this study, we developed a comprehendible prescriptive framework, named CUC-FATE (Cost, Usability, Credibility, Fairness, Accountability, Transparency, and Explainability), for evaluating the enablers of LLMs in medical education. The study findings are useful for health care professionals, health technology experts, medical technology regulators, and policy makers. UR - https://ai.jmir.org/2024/1/e51834 UR - http://dx.doi.org/10.2196/51834 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875562 ID - info:doi/10.2196/51834 ER - TY - JOUR AU - Lange, Martin AU - Löwe, Alexandra AU - Kayser, Ina AU - Schaller, Andrea PY - 2024/8/20 TI - Approaches for the Use of AI in Workplace Health Promotion and Prevention: Systematic Scoping Review JO - JMIR AI SP - e53506 VL - 3 KW - artificial intelligence KW - AI KW - machine learning KW - deep learning KW - workplace health promotion KW - prevention KW - workplace health promotion and prevention KW - technology KW - technologies KW - well-being KW - behavioral health KW - workplace-related KW - public health KW - biomedicine KW - PRISMA-ScR KW - Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews KW - WHPP KW - risk KW - AI-algorithm KW - control group KW - accuracy KW - health-related KW - prototype KW - systematic review KW - scoping review KW - reviews KW - mobile phone N2 - Background: Artificial intelligence (AI) is an umbrella term for various algorithms and rapidly emerging technologies with huge potential for workplace health promotion and prevention (WHPP). WHPP interventions aim to improve people?s health and well-being through behavioral and organizational measures or by minimizing the burden of workplace-related diseases and associated risk factors. While AI has been the focus of research in other health-related fields, such as public health or biomedicine, the transition of AI into WHPP research has yet to be systematically investigated. Objective: The systematic scoping review aims to comprehensively assess an overview of the current use of AI in WHPP. The results will be then used to point to future research directions. The following research questions were derived: (1) What are the study characteristics of studies on AI algorithms and technologies in the context of WHPP? (2) What specific WHPP fields (prevention, behavioral, and organizational approaches) were addressed by the AI algorithms and technologies? (3) What kind of interventions lead to which outcomes? Methods: A systematic scoping literature review (PRISMA-ScR [Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews]) was conducted in the 3 academic databases PubMed, Institute of Electrical and Electronics Engineers, and Association for Computing Machinery in July 2023, searching for papers published between January 2000 and December 2023. Studies needed to be (1) peer-reviewed, (2) written in English, and (3) focused on any AI-based algorithm or technology that (4) were conducted in the context of WHPP or (5) an associated field. Information on study design, AI algorithms and technologies, WHPP fields, and the patient or population, intervention, comparison, and outcomes framework were extracted blindly with Rayyan and summarized. Results: A total of 10 studies were included. Risk prevention and modeling were the most identified WHPP fields (n=6), followed by behavioral health promotion (n=4) and organizational health promotion (n=1). Further, 4 studies focused on mental health. Most AI algorithms were machine learning-based, and 3 studies used combined deep learning algorithms. AI algorithms and technologies were primarily implemented in smartphone apps (eg, in the form of a chatbot) or used the smartphone as a data source (eg, Global Positioning System). Behavioral approaches ranged from 8 to 12 weeks and were compared to control groups. Additionally, 3 studies evaluated the robustness and accuracy of an AI model or framework. Conclusions: Although AI has caught increasing attention in health-related research, the review reveals that AI in WHPP is marginally investigated. Our results indicate that AI is promising for individualization and risk prediction in WHPP, but current research does not cover the scope of WHPP. Beyond that, future research will profit from an extended range of research in all fields of WHPP, longitudinal data, and reporting guidelines. Trial Registration: OSF Registries osf.io/bfswp; https://osf.io/bfswp UR - https://ai.jmir.org/2024/1/e53506 UR - http://dx.doi.org/10.2196/53506 UR - http://www.ncbi.nlm.nih.gov/pubmed/38989904 ID - info:doi/10.2196/53506 ER - TY - JOUR AU - Ojha, Tanvi AU - Patel, Atushi AU - Sivapragasam, Krishihan AU - Sharma, Radha AU - Vosoughi, Tina AU - Skidmore, Becky AU - Pinto, D. Andrew AU - Hosseini, Banafshe PY - 2024/8/27 TI - Exploring Machine Learning Applications in Pediatric Asthma Management: Scoping Review JO - JMIR AI SP - e57983 VL - 3 KW - pediatric asthma KW - machine learning KW - predictive modeling KW - asthma management KW - exacerbation KW - artificial intelligence N2 - Background: The integration of machine learning (ML) in predicting asthma-related outcomes in children presents a novel approach in pediatric health care. Objective: This scoping review aims to analyze studies published since 2019, focusing on ML algorithms, their applications, and predictive performances. Methods: We searched Ovid MEDLINE ALL and Embase on Ovid, the Cochrane Library (Wiley), CINAHL (EBSCO), and Web of Science (core collection). The search covered the period from January 1, 2019, to July 18, 2023. Studies applying ML models in predicting asthma-related outcomes in children aged <18 years were included. Covidence was used for citation management, and the risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool. Results: From 1231 initial articles, 15 met our inclusion criteria. The sample size ranged from 74 to 87,413 patients. Most studies used multiple ML techniques, with logistic regression (n=7, 47%) and random forests (n=6, 40%) being the most common. Key outcomes included predicting asthma exacerbations, classifying asthma phenotypes, predicting asthma diagnoses, and identifying potential risk factors. For predicting exacerbations, recurrent neural networks and XGBoost showed high performance, with XGBoost achieving an area under the receiver operating characteristic curve (AUROC) of 0.76. In classifying asthma phenotypes, support vector machines were highly effective, achieving an AUROC of 0.79. For diagnosis prediction, artificial neural networks outperformed logistic regression, with an AUROC of 0.63. To identify risk factors focused on symptom severity and lung function, random forests achieved an AUROC of 0.88. Sound-based studies distinguished wheezing from nonwheezing and asthmatic from normal coughs. The risk of bias assessment revealed that most studies (n=8, 53%) exhibited low to moderate risk, ensuring a reasonable level of confidence in the findings. Common limitations across studies included data quality issues, sample size constraints, and interpretability concerns. Conclusions: This review highlights the diverse application of ML in predicting pediatric asthma outcomes, with each model offering unique strengths and challenges. Future research should address data quality, increase sample sizes, and enhance model interpretability to optimize ML utility in clinical settings for pediatric asthma management. UR - https://ai.jmir.org/2024/1/e57983 UR - http://dx.doi.org/10.2196/57983 UR - http://www.ncbi.nlm.nih.gov/pubmed/39190449 ID - info:doi/10.2196/57983 ER - TY - JOUR AU - Rosenbacke, Rikard AU - Melhus, Åsa AU - McKee, Martin AU - Stuckler, David PY - 2024/10/30 TI - How Explainable Artificial Intelligence Can Increase or Decrease Clinicians? Trust in AI Applications in Health Care: Systematic Review JO - JMIR AI SP - e53207 VL - 3 KW - explainable artificial intelligence KW - XAI KW - trustworthy AI KW - clinician trust KW - affect-based measures KW - cognitive measures KW - clinical use KW - clinical decision-making KW - clinical informatics N2 - Background: Artificial intelligence (AI) has significant potential in clinical practice. However, its ?black box? nature can lead clinicians to question its value. The challenge is to create sufficient trust for clinicians to feel comfortable using AI, but not so much that they defer to it even when it produces results that conflict with their clinical judgment in ways that lead to incorrect decisions. Explainable AI (XAI) aims to address this by providing explanations of how AI algorithms reach their conclusions. However, it remains unclear whether such explanations foster an appropriate degree of trust to ensure the optimal use of AI in clinical practice. Objective: This study aims to systematically review and synthesize empirical evidence on the impact of XAI on clinicians? trust in AI-driven clinical decision-making. Methods: A systematic review was conducted in accordance with PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, searching PubMed and Web of Science databases. Studies were included if they empirically measured the impact of XAI on clinicians? trust using cognition- or affect-based measures. Out of 778 articles screened, 10 met the inclusion criteria. We assessed the risk of bias using standard tools appropriate to the methodology of each paper. Results: The risk of bias in all papers was moderate or moderate to high. All included studies operationalized trust primarily through cognitive-based definitions, with 2 also incorporating affect-based measures. Out of these, 5 studies reported that XAI increased clinicians? trust compared with standard AI, particularly when the explanations were clear, concise, and relevant to clinical practice. In addition, 3 studies found no significant effect of XAI on trust, and the presence of explanations does not automatically improve trust. Notably, 2 studies highlighted that XAI could either enhance or diminish trust, depending on the complexity and coherence of the provided explanations. The majority of studies suggest that XAI has the potential to enhance clinicians? trust in recommendations generated by AI. However, complex or contradictory explanations can undermine this trust. More critically, trust in AI is not inherently beneficial, as AI recommendations are not infallible. These findings underscore the nuanced role of explanation quality and suggest that trust can be modulated through the careful design of XAI systems. Conclusions: Excessive trust in incorrect advice generated by AI can adversely impact clinical accuracy, just as can happen when correct advice is distrusted. Future research should focus on refining both cognitive and affect-based measures of trust and on developing strategies to achieve an appropriate balance in terms of trust, preventing both blind trust and undue skepticism. Optimizing trust in AI systems is essential for their effective integration into clinical practice. UR - https://ai.jmir.org/2024/1/e53207 UR - http://dx.doi.org/10.2196/53207 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/53207 ER - TY - JOUR AU - Oliveira, Almeida Juliana AU - Eskandar, Karine AU - Kar, Emre AU - de Oliveira, Ribeiro Flįvia AU - Filho, Silva Agnaldo Lopes da PY - 2024/10/30 TI - Understanding AI?s Role in Endometriosis Patient Education and Evaluating Its Information and Accuracy: Systematic Review JO - JMIR AI SP - e64593 VL - 3 KW - endometriosis KW - gynecology KW - machine learning KW - artificial intelligence KW - large language models KW - natural language processing KW - patient-generated health data KW - health knowledge KW - information seeking KW - patient education N2 - Background: Endometriosis is a chronic gynecological condition that affects a significant portion of women of reproductive age, leading to debilitating symptoms such as chronic pelvic pain and infertility. Despite advancements in diagnosis and management, patient education remains a critical challenge. With the rapid growth of digital platforms, artificial intelligence (AI) has emerged as a potential tool to enhance patient education and access to information. Objective: This systematic review aims to explore the role of AI in facilitating education and improving information accessibility for individuals with endometriosis. Methods: This review followed the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines to ensure rigorous and transparent reporting. We conducted a comprehensive search of PubMed; Embase; the Regional Online Information System for Scientific Journals of Latin America, the Caribbean, Spain and Portugal (LATINDEX); Latin American and Caribbean Literature in Health Sciences (LILACS); Institute of Electrical and Electronics Engineers (IEEE) Xplore, and the Cochrane Central Register of Controlled Trials using the terms ?endometriosis? and ?artificial intelligence.? Studies were selected based on their focus on AI applications in patient education or information dissemination regarding endometriosis. We included studies that evaluated AI-driven tools for assessing patient knowledge and addressed frequently asked questions related to endometriosis. Data extraction and quality assessment were conducted independently by 2 authors, with discrepancies resolved through consensus. Results: Out of 400 initial search results, 11 studies met the inclusion criteria and were fully reviewed. We ultimately included 3 studies, 1 of which was an abstract. The studies examined the use of AI models, such as ChatGPT (OpenAI), machine learning, and natural language processing, in providing educational resources and answering common questions about endometriosis. The findings indicated that AI tools, particularly large language models, offer accurate responses to frequently asked questions with varying degrees of sufficiency across different categories. AI?s integration with social media platforms also highlights its potential to identify patients? needs and enhance information dissemination. Conclusions: AI holds promise in advancing patient education and information access for endometriosis, providing accurate and comprehensive answers to common queries, and facilitating a better understanding of the condition. However, challenges remain in ensuring ethical use, equitable access, and maintaining accuracy across diverse patient populations. Future research should focus on developing standardized approaches for evaluating AI?s impact on patient education and exploring its integration into clinical practice to enhance support for individuals with endometriosis. UR - https://ai.jmir.org/2024/1/e64593 UR - http://dx.doi.org/10.2196/64593 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/64593 ER - TY - JOUR AU - Han, Yu AU - Ceross, Aaron AU - Bergmann, Jeroen PY - 2024/7/29 TI - Regulatory Frameworks for AI-Enabled Medical Device Software in China: Comparative Analysis and Review of Implications for Global Manufacturer JO - JMIR AI SP - e46871 VL - 3 KW - NMPA KW - medical device software KW - device registration KW - registration pathway KW - artificial intelligence KW - machine learning KW - medical device KW - device development KW - China KW - regulations KW - medical software UR - https://ai.jmir.org/2024/1/e46871 UR - http://dx.doi.org/10.2196/46871 UR - http://www.ncbi.nlm.nih.gov/pubmed/39073860 ID - info:doi/10.2196/46871 ER - TY - JOUR AU - Bragazzi, Luigi Nicola AU - Garbarino, Sergio PY - 2024/6/7 TI - Toward Clinical Generative AI: Conceptual Framework JO - JMIR AI SP - e55957 VL - 3 KW - clinical intelligence KW - artificial intelligence KW - iterative process KW - abduction KW - benchmarking KW - verification paradigms UR - https://ai.jmir.org/2024/1/e55957 UR - http://dx.doi.org/10.2196/55957 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875592 ID - info:doi/10.2196/55957 ER - TY - JOUR AU - Seth, Puneet AU - Carretas, Romina AU - Rudzicz, Frank PY - 2024/10/4 TI - The Utility and Implications of Ambient Scribes in Primary Care JO - JMIR AI SP - e57673 VL - 3 KW - artificial intelligence KW - AI KW - large language model KW - LLM KW - digital scribe KW - ambient scribe KW - organizational efficiency KW - electronic health record KW - documentation burden KW - administrative burden UR - https://ai.jmir.org/2024/1/e57673 UR - http://dx.doi.org/10.2196/57673 UR - http://www.ncbi.nlm.nih.gov/pubmed/39365655 ID - info:doi/10.2196/57673 ER - TY - JOUR AU - Germani, Federico AU - Spitale, Giovanni AU - Biller-Andorno, Nikola PY - 2024/10/15 TI - The Dual Nature of AI in Information Dissemination: Ethical Considerations JO - JMIR AI SP - e53505 VL - 3 KW - AI KW - bioethics KW - infodemic management KW - disinformation KW - artificial intelligence KW - ethics KW - ethical KW - infodemic KW - infodemics KW - public health KW - misinformation KW - information dissemination KW - information literacy UR - https://ai.jmir.org/2024/1/e53505 UR - http://dx.doi.org/10.2196/53505 UR - http://www.ncbi.nlm.nih.gov/pubmed/39405099 ID - info:doi/10.2196/53505 ER - TY - JOUR AU - Gupta, Vikash AU - Erdal, Barbaros AU - Ramirez, Carolina AU - Floca, Ralf AU - Genereaux, Bradley AU - Bryson, Sidney AU - Bridge, Christopher AU - Kleesiek, Jens AU - Nensa, Felix AU - Braren, Rickmer AU - Younis, Khaled AU - Penzkofer, Tobias AU - Bucher, Michael Andreas AU - Qin, Melvin Ming AU - Bae, Gigon AU - Lee, Hyeonhoon AU - Cardoso, Jorge M. AU - Ourselin, Sebastien AU - Kerfoot, Eric AU - Choudhury, Rahul AU - White, D. Richard AU - Cook, Tessa AU - Bericat, David AU - Lungren, Matthew AU - Haukioja, Risto AU - Shuaib, Haris PY - 2024/12/9 TI - Current State of Community-Driven Radiological AI Deployment in Medical Imaging JO - JMIR AI SP - e55833 VL - 3 KW - radiology KW - open-source KW - radiology in practice KW - deep learning KW - artificial intelligence KW - imaging informatics KW - clinical deployment KW - imaging KW - medical informatics KW - workflow KW - operation KW - implementation KW - adoption KW - taxonomy KW - use case KW - model KW - integration KW - machine learning KW - mobile phone UR - https://ai.jmir.org/2024/1/e55833 UR - http://dx.doi.org/10.2196/55833 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/55833 ER - TY - JOUR AU - Späth, Julian AU - Sewald, Zeno AU - Probul, Niklas AU - Berland, Magali AU - Almeida, Mathieu AU - Pons, Nicolas AU - Le Chatelier, Emmanuelle AU - Ginčs, Pere AU - Solé, Cristina AU - Juanola, Adrią AU - Pauling, Josch AU - Baumbach, Jan PY - 2024/3/29 TI - Privacy-Preserving Federated Survival Support Vector Machines for Cross-Institutional Time-To-Event Analysis: Algorithm Development and Validation JO - JMIR AI SP - e47652 VL - 3 KW - federated learning KW - survival analysis KW - support vector machine KW - machine learning KW - federated KW - algorithm KW - survival KW - FeatureCloud KW - predict KW - predictive KW - prediction KW - predictions KW - Implementation science KW - Implementation KW - centralized model KW - privacy regulation N2 - Background: Central collection of distributed medical patient data is problematic due to strict privacy regulations. Especially in clinical environments, such as clinical time-to-event studies, large sample sizes are critical but usually not available at a single institution. It has been shown recently that federated learning, combined with privacy-enhancing technologies, is an excellent and privacy-preserving alternative to data sharing. Objective: This study aims to develop and validate a privacy-preserving, federated survival support vector machine (SVM) and make it accessible for researchers to perform cross-institutional time-to-event analyses. Methods: We extended the survival SVM algorithm to be applicable in federated environments. We further implemented it as a FeatureCloud app, enabling it to run in the federated infrastructure provided by the FeatureCloud platform. Finally, we evaluated our algorithm on 3 benchmark data sets, a large sample size synthetic data set, and a real-world microbiome data set and compared the results to the corresponding central method. Results: Our federated survival SVM produces highly similar results to the centralized model on all data sets. The maximal difference between the model weights of the central model and the federated model was only 0.001, and the mean difference over all data sets was 0.0002. We further show that by including more data in the analysis through federated learning, predictions are more accurate even in the presence of site-dependent batch effects. Conclusions: The federated survival SVM extends the palette of federated time-to-event analysis methods by a robust machine learning approach. To our knowledge, the implemented FeatureCloud app is the first publicly available implementation of a federated survival SVM, is freely accessible for all kinds of researchers, and can be directly used within the FeatureCloud platform. UR - https://ai.jmir.org/2024/1/e47652 UR - http://dx.doi.org/10.2196/47652 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875678 ID - info:doi/10.2196/47652 ER - TY - JOUR AU - Tan, Kuan Joshua AU - Quan, Le AU - Salim, Mohamed Nur Nasyitah AU - Tan, Hong Jen AU - Goh, Su-Yen AU - Thumboo, Julian AU - Bee, Mong Yong PY - 2024/10/17 TI - Machine Learning?Based Prediction for High Health Care Utilizers by Using a Multi-Institutional Diabetes Registry: Model Training and Evaluation JO - JMIR AI SP - e58463 VL - 3 KW - diabetes mellitus KW - type 2 diabetes KW - health care utilization KW - population health management KW - population health KW - machine learning KW - artificial intelligence KW - predictive model KW - predictive system KW - practical model N2 - Background: The cost of health care in many countries is increasing rapidly. There is a growing interest in using machine learning for predicting high health care utilizers for population health initiatives. Previous studies have focused on individuals who contribute to the highest financial burden. However, this group is small and represents a limited opportunity for long-term cost reduction. Objective: We developed a collection of models that predict future health care utilization at various thresholds. Methods: We utilized data from a multi-institutional diabetes database from the year 2019 to develop binary classification models. These models predict health care utilization in the subsequent year across 6 different outcomes: patients having a length of stay of ?7, ?14, and ?30 days and emergency department attendance of ?3, ?5, and ?10 visits. To address class imbalance, random and synthetic minority oversampling techniques were employed. The models were then applied to unseen data from 2020 and 2021 to predict health care utilization in the following year. A portfolio of performance metrics, with priority on area under the receiver operating characteristic curve, sensitivity, and positive predictive value, was used for comparison. Explainability analyses were conducted on the best performing models. Results: When trained with random oversampling, 4 models, that is, logistic regression, multivariate adaptive regression splines, boosted trees, and multilayer perceptron consistently achieved high area under the receiver operating characteristic curve (>0.80) and sensitivity (>0.60) across training-validation and test data sets. Correcting for class imbalance proved critical for model performance. Important predictors for all outcomes included age, number of emergency department visits in the present year, chronic kidney disease stage, inpatient bed days in the present year, and mean hemoglobin A1c levels. Explainability analyses using partial dependence plots demonstrated that for the best performing models, the learned patterns were consistent with real-world knowledge, thereby supporting the validity of the models. Conclusions: We successfully developed machine learning models capable of predicting high service level utilization with strong performance and valid explainability. These models can be integrated into wider diabetes-related population health initiatives. UR - https://ai.jmir.org/2024/1/e58463 UR - http://dx.doi.org/10.2196/58463 UR - http://www.ncbi.nlm.nih.gov/pubmed/39418089 ID - info:doi/10.2196/58463 ER - TY - JOUR AU - Hodson, Nathan AU - Williamson, Simon PY - 2024/7/30 TI - Can Large Language Models Replace Therapists? Evaluating Performance at Simple Cognitive Behavioral Therapy Tasks JO - JMIR AI SP - e52500 VL - 3 KW - mental health KW - psychotherapy KW - digital therapy KW - CBT KW - ChatGPT KW - cognitive behavioral therapy KW - cognitive behavioural therapy KW - LLM KW - LLMs KW - language model KW - language models KW - NLP KW - natural language processing KW - artificial intelligence KW - performance KW - chatbot KW - chatbots KW - conversational agent KW - conversational agents UR - https://ai.jmir.org/2024/1/e52500 UR - http://dx.doi.org/10.2196/52500 UR - http://www.ncbi.nlm.nih.gov/pubmed/39078696 ID - info:doi/10.2196/52500 ER - TY - JOUR AU - Rollwage, Max AU - Habicht, Johanna AU - Juechems, Keno AU - Carrington, Ben AU - Viswanathan, Sruthi AU - Stylianou, Mona AU - Hauser, U. Tobias AU - Harper, Ross PY - 2024/3/12 TI - Correction: Using Conversational AI to Facilitate Mental Health Assessments and Improve Clinical Efficiency Within Psychotherapy Services: Real-World Observational Study JO - JMIR AI SP - e57869 VL - 3 UR - https://ai.jmir.org/2024/1/e57869 UR - http://dx.doi.org/10.2196/57869 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/57869 ER - TY - JOUR AU - Noda, Masao AU - Yoshimura, Hidekane AU - Okubo, Takuya AU - Koshu, Ryota AU - Uchiyama, Yuki AU - Nomura, Akihiro AU - Ito, Makoto AU - Takumi, Yutaka PY - 2024/7/9 TI - Correction: Feasibility of Multimodal Artificial Intelligence Using GPT-4 Vision for the Classification of Middle Ear Disease: Qualitative Study and Validation JO - JMIR AI SP - e62990 VL - 3 UR - https://ai.jmir.org/2024/1/e62990 UR - http://dx.doi.org/10.2196/62990 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/62990 ER - TY - JOUR AU - Kamruzzaman, Methun AU - Heavey, Jack AU - Song, Alexander AU - Bielskas, Matthew AU - Bhattacharya, Parantapa AU - Madden, Gregory AU - Klein, Eili AU - Deng, Xinwei AU - Vullikanti, Anil PY - 2024/5/16 TI - Improving Risk Prediction of Methicillin-Resistant Staphylococcus aureus Using Machine Learning Methods With Network Features: Retrospective Development Study JO - JMIR AI SP - e48067 VL - 3 KW - methicillin-resistant Staphylococcus aureus KW - network KW - machine learning KW - penalized logistic regression KW - ensemble learning KW - gradient-boosted classifier KW - random forest classifier KW - extreme boosted gradient boosted classifier KW - Shapley Additive Explanations KW - SHAP KW - health care?associated infection KW - HAI N2 - Background: Health care?associated infections due to multidrug-resistant organisms (MDROs), such as methicillin-resistant Staphylococcus aureus (MRSA) and Clostridioides difficile (CDI), place a significant burden on our health care infrastructure. Objective: Screening for MDROs is an important mechanism for preventing spread but is resource intensive. The objective of this study was to develop automated tools that can predict colonization or infection risk using electronic health record (EHR) data, provide useful information to aid infection control, and guide empiric antibiotic coverage. Methods: We retrospectively developed a machine learning model to detect MRSA colonization and infection in undifferentiated patients at the time of sample collection from hospitalized patients at the University of Virginia Hospital. We used clinical and nonclinical features derived from on-admission and throughout-stay information from the patient?s EHR data to build the model. In addition, we used a class of features derived from contact networks in EHR data; these network features can capture patients? contacts with providers and other patients, improving model interpretability and accuracy for predicting the outcome of surveillance tests for MRSA. Finally, we explored heterogeneous models for different patient subpopulations, for example, those admitted to an intensive care unit or emergency department or those with specific testing histories, which perform better. Results: We found that the penalized logistic regression performs better than other methods, and this model?s performance measured in terms of its receiver operating characteristics-area under the curve score improves by nearly 11% when we use polynomial (second-degree) transformation of the features. Some significant features in predicting MDRO risk include antibiotic use, surgery, use of devices, dialysis, patient?s comorbidity conditions, and network features. Among these, network features add the most value and improve the model?s performance by at least 15%. The penalized logistic regression model with the same transformation of features also performs better than other models for specific patient subpopulations. Conclusions: Our study shows that MRSA risk prediction can be conducted quite effectively by machine learning methods using clinical and nonclinical features derived from EHR data. Network features are the most predictive and provide significant improvement over prior methods. Furthermore, heterogeneous prediction models for different patient subpopulations enhance the model?s performance. UR - https://ai.jmir.org/2024/1/e48067 UR - http://dx.doi.org/10.2196/48067 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875598 ID - info:doi/10.2196/48067 ER - TY - JOUR AU - Karpathakis, Kassandra AU - Pencheon, Emma AU - Cushnan, Dominic PY - 2024/1/4 TI - Learning From International Comparators of National Medical Imaging Initiatives for AI Development: Multiphase Qualitative Study JO - JMIR AI SP - e51168 VL - 3 KW - digital health KW - mobile health KW - mHealth KW - medical imaging KW - artificial intelligence KW - health policy N2 - Background: The COVID-19 pandemic drove investment and research into medical imaging platforms to provide data to create artificial intelligence (AI) algorithms for the management of patients with COVID-19. Building on the success of England?s National COVID-19 Chest Imaging Database, the national digital policy body (NHSX) sought to create a generalized national medical imaging platform for the development, validation, and deployment of algorithms. Objective: This study aims to understand international use cases of medical imaging platforms for the development and implementation of algorithms to inform the creation of England?s national imaging platform. Methods: The National Health Service (NHS) AI Lab Policy and Strategy Team adopted a multiphased approach: (1) identification and prioritization of national AI imaging platforms; (2) Political, Economic, Social, Technological, Legal, and Environmental (PESTLE) factor analysis deep dive into national AI imaging platforms; (3) semistructured interviews with key stakeholders; (4) workshop on emerging themes and insights with the internal NHSX team; and (5) formulation of policy recommendations. Results: International use cases of national AI imaging platforms (n=7) were prioritized for PESTLE factor analysis. Stakeholders (n=13) from the international use cases were interviewed. Themes (n=8) from the semistructured interviews, including interview quotes, were analyzed with workshop participants (n=5). The outputs of the deep dives, interviews, and workshop were synthesized thematically into 8 categories with 17 subcategories. On the basis of the insights from the international use cases, policy recommendations (n=12) were developed to support the NHS AI Lab in the design and development of the English national medical imaging platform. Conclusions: The creation of AI algorithms supporting technology and infrastructure such as platforms often occurs in isolation within countries, let alone between countries. This novel policy research project sought to bridge the gap by learning from the challenges, successes, and experience of England?s international counterparts. Policy recommendations based on international learnings focused on the demonstrable benefits of the platform to secure sustainable funding, validation of algorithms and infrastructure to support in situ deployment, and creating wraparound tools for nontechnical participants such as clinicians to engage with algorithm creation. As health care organizations increasingly adopt technological solutions, policy makers have a responsibility to ensure that initiatives are informed by learnings from both national and international initiatives as well as disseminating the outcomes of their work. UR - https://ai.jmir.org/2024/1/e51168 UR - http://dx.doi.org/10.2196/51168 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/51168 ER - TY - JOUR AU - Brann, Felix AU - Sterling, William Nicholas AU - Frisch, O. Stephanie AU - Schrager, D. Justin PY - 2024/1/25 TI - Sepsis Prediction at Emergency Department Triage Using Natural Language Processing: Retrospective Cohort Study JO - JMIR AI SP - e49784 VL - 3 KW - natural language processing KW - machine learning KW - sepsis KW - emergency department KW - triage N2 - Background: Despite its high lethality, sepsis can be difficult to detect on initial presentation to the emergency department (ED). Machine learning?based tools may provide avenues for earlier detection and lifesaving intervention. Objective: The study aimed to predict sepsis at the time of ED triage using natural language processing of nursing triage notes and available clinical data. Methods: We constructed a retrospective cohort of all 1,234,434 consecutive ED encounters in 2015-2021 from 4 separate clinically heterogeneous academically affiliated EDs. After exclusion criteria were applied, the final cohort included 1,059,386 adult ED encounters. The primary outcome criteria for sepsis were presumed severe infection and acute organ dysfunction. After vectorization and dimensional reduction of triage notes and clinical data available at triage, a decision tree?based ensemble (time-of-triage) model was trained to predict sepsis using the training subset (n=950,921). A separate (comprehensive) model was trained using these data and laboratory data, as it became available at 1-hour intervals, after triage. Model performances were evaluated using the test (n=108,465) subset. Results: Sepsis occurred in 35,318 encounters (incidence 3.45%). For sepsis prediction at the time of patient triage, using the primary definition, the area under the receiver operating characteristic curve (AUC) and macro F1-score for sepsis were 0.94 and 0.61, respectively. Sensitivity, specificity, and false positive rate were 0.87, 0.85, and 0.15, respectively. The time-of-triage model accurately predicted sepsis in 76% (1635/2150) of sepsis cases where sepsis screening was not initiated at triage and 97.5% (1630/1671) of cases where sepsis screening was initiated at triage. Positive and negative predictive values were 0.18 and 0.99, respectively. For sepsis prediction using laboratory data available each hour after ED arrival, the AUC peaked to 0.97 at 12 hours. Similar results were obtained when stratifying by hospital and when Centers for Disease Control and Prevention hospital toolkit for adult sepsis surveillance criteria were used to define sepsis. Among septic cases, sepsis was predicted in 36.1% (1375/3814), 49.9% (1902/3814), and 68.3% (2604/3814) of encounters, respectively, at 3, 2, and 1 hours prior to the first intravenous antibiotic order or where antibiotics where not ordered within the first 12 hours. Conclusions: Sepsis can accurately be predicted at ED presentation using nursing triage notes and clinical information available at the time of triage. This indicates that machine learning can facilitate timely and reliable alerting for intervention. Free-text data can improve the performance of predictive modeling at the time of triage and throughout the ED course. UR - https://ai.jmir.org/2024/1/e49784 UR - http://dx.doi.org/10.2196/49784 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875594 ID - info:doi/10.2196/49784 ER - TY - JOUR AU - Xie, Fagen AU - Chang, Jenny AU - Luong, Tiffany AU - Wu, Bechien AU - Lustigova, Eva AU - Shrader, Eva AU - Chen, Wansu PY - 2024/1/15 TI - Identifying Symptoms Prior to Pancreatic Ductal Adenocarcinoma Diagnosis in Real-World Care Settings: Natural Language Processing Approach JO - JMIR AI SP - e51240 VL - 3 KW - cancer KW - pancreatic ductal adenocarcinoma KW - symptom KW - clinical note KW - electronic health record KW - natural language processing KW - computerized algorithm KW - pancreatic cancer KW - cancer death KW - abdominal pain KW - pain KW - validation KW - detection KW - pancreas N2 - Background: Pancreatic cancer is the third leading cause of cancer deaths in the United States. Pancreatic ductal adenocarcinoma (PDAC) is the most common form of pancreatic cancer, accounting for up to 90% of all cases. Patient-reported symptoms are often the triggers of cancer diagnosis and therefore, understanding the PDAC-associated symptoms and the timing of symptom onset could facilitate early detection of PDAC. Objective: This paper aims to develop a natural language processing (NLP) algorithm to capture symptoms associated with PDAC from clinical notes within a large integrated health care system. Methods: We used unstructured data within 2 years prior to PDAC diagnosis between 2010 and 2019 and among matched patients without PDAC to identify 17 PDAC-related symptoms. Related terms and phrases were first compiled from publicly available resources and then recursively reviewed and enriched with input from clinicians and chart review. A computerized NLP algorithm was iteratively developed and fine-trained via multiple rounds of chart review followed by adjudication. Finally, the developed algorithm was applied to the validation data set to assess performance and to the study implementation notes. Results: A total of 408,147 and 709,789 notes were retrieved from 2611 patients with PDAC and 10,085 matched patients without PDAC, respectively. In descending order, the symptom distribution of the study implementation notes ranged from 4.98% for abdominal or epigastric pain to 0.05% for upper extremity deep vein thrombosis in the PDAC group, and from 1.75% for back pain to 0.01% for pale stool in the non-PDAC group. Validation of the NLP algorithm against adjudicated chart review results of 1000 notes showed that precision ranged from 98.9% (jaundice) to 84% (upper extremity deep vein thrombosis), recall ranged from 98.1% (weight loss) to 82.8% (epigastric bloating), and F1-scores ranged from 0.97 (jaundice) to 0.86 (depression). Conclusions: The developed and validated NLP algorithm could be used for the early detection of PDAC. UR - https://ai.jmir.org/2024/1/e51240 UR - http://dx.doi.org/10.2196/51240 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875566 ID - info:doi/10.2196/51240 ER - TY - JOUR AU - Irie, Fumi AU - Matsumoto, Koutarou AU - Matsuo, Ryu AU - Nohara, Yasunobu AU - Wakisaka, Yoshinobu AU - Ago, Tetsuro AU - Nakashima, Naoki AU - Kitazono, Takanari AU - Kamouchi, Masahiro PY - 2024/1/11 TI - Predictive Performance of Machine Learning?Based Models for Poststroke Clinical Outcomes in Comparison With Conventional Prognostic Scores: Multicenter, Hospital-Based Observational Study JO - JMIR AI SP - e46840 VL - 3 KW - brain infarction KW - outcome KW - prediction KW - machine learning KW - prognostic score N2 - Background: Although machine learning is a promising tool for making prognoses, the performance of machine learning in predicting outcomes after stroke remains to be examined. Objective: This study aims to examine how much data-driven models with machine learning improve predictive performance for poststroke outcomes compared with conventional stroke prognostic scores and to elucidate how explanatory variables in machine learning?based models differ from the items of the stroke prognostic scores. Methods: We used data from 10,513 patients who were registered in a multicenter prospective stroke registry in Japan between 2007 and 2017. The outcomes were poor functional outcome (modified Rankin Scale score >2) and death at 3 months after stroke. Machine learning?based models were developed using all variables with regularization methods, random forests, or boosted trees. We selected 3 stroke prognostic scores, namely, ASTRAL (Acute Stroke Registry and Analysis of Lausanne), PLAN (preadmission comorbidities, level of consciousness, age, neurologic deficit), and iScore (Ischemic Stroke Predictive Risk Score) for comparison. Item-based regression models were developed using the items of these 3 scores. The model performance was assessed in terms of discrimination and calibration. To compare the predictive performance of the data-driven model with that of the item-based model, we performed internal validation after random splits of identical populations into 80% of patients as a training set and 20% of patients as a test set; the models were developed in the training set and were validated in the test set. We evaluated the contribution of each variable to the models and compared the predictors used in the machine learning?based models with the items of the stroke prognostic scores. Results: The mean age of the study patients was 73.0 (SD 12.5) years, and 59.1% (6209/10,513) of them were men. The area under the receiver operating characteristic curves and the area under the precision-recall curves for predicting poststroke outcomes were higher for machine learning?based models than for item-based models in identical populations after random splits. Machine learning?based models also performed better than item-based models in terms of the Brier score. Machine learning?based models used different explanatory variables, such as laboratory data, from the items of the conventional stroke prognostic scores. Including these data in the machine learning?based models as explanatory variables improved performance in predicting outcomes after stroke, especially poststroke death. Conclusions: Machine learning?based models performed better in predicting poststroke outcomes than regression models using the items of conventional stroke prognostic scores, although they required additional variables, such as laboratory data, to attain improved performance. Further studies are warranted to validate the usefulness of machine learning in clinical settings. UR - https://ai.jmir.org/2024/1/e46840 UR - http://dx.doi.org/10.2196/46840 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875590 ID - info:doi/10.2196/46840 ER - TY - JOUR AU - Iwamoto, Hiroki AU - Nakano, Saki AU - Tajima, Ryotaro AU - Kiguchi, Ryo AU - Yoshida, Yuki AU - Kitanishi, Yoshitake AU - Aoki, Yasunori PY - 2024/8/2 TI - Predicting Workers? Stress: Application of a High-Performance Algorithm Using Working-Style Characteristics JO - JMIR AI SP - e55840 VL - 3 KW - high-performance algorithm KW - Japan KW - questionnaire KW - stress prediction model KW - teleworking KW - wearable device N2 - Background: Work characteristics, such as teleworking rate, have been studied in relation to stress. However, the use of work-related data to improve a high-performance stress prediction model that suits an individual?s lifestyle has not been evaluated. Objective: This study aims to develop a novel, high-performance algorithm to predict an employee?s stress among a group of employees with similar working characteristics. Methods: This prospective observational study evaluated participants? responses to web?based questionnaires, including attendance records and data collected using a wearable device. Data spanning 12 weeks (between January 17, 2022, and April 10, 2022) were collected from 194 Shionogi Group employees. Participants wore the Fitbit Charge 4 wearable device, which collected data on daily sleep, activity, and heart rate. Daily work shift data included details of working hours. Weekly questionnaire responses included the K6 questionnaire for depression/anxiety, a behavioral questionnaire, and the number of days lunch was missed. The proposed prediction model used a neighborhood cluster (N=20) with working-style characteristics similar to those of the prediction target person. Data from the previous week predicted stress levels the following week. Three models were compared by selecting appropriate training data: (1) single model, (2) proposed method 1, and (3) proposed method 2. Shapley Additive Explanations (SHAP) were calculated for the top 10 extracted features from the Extreme Gradient Boosting (XGBoost) model to evaluate the amount and contribution direction categorized by teleworking rates (mean): low: <0.2 (more than 4 days/week in office), middle: 0.2 to <0.6 (2 to 4 days/week in office), and high: ?0.6 (less than 2 days/week in office). Results: Data from 190 participants were used, with a teleworking rate ranging from 0% to 79%. The area under the curve (AUC) of the proposed method 2 was 0.84 (true positive vs false positive: 0.77 vs 0.26). Among participants with low teleworking rates, most features extracted were related to sleep, followed by activity and work. Among participants with high teleworking rates, most features were related to activity, followed by sleep and work. SHAP analysis showed that for participants with high teleworking rates, skipping lunch, working more/less than scheduled, higher fluctuations in heart rate, and lower mean sleep duration contributed to stress. In participants with low teleworking rates, coming too early or late to work (before/after 9 AM), a higher/lower than mean heart rate, lower fluctuations in heart rate, and burning more/fewer calories than normal contributed to stress. Conclusions: Forming a neighborhood cluster with similar working styles based on teleworking rates and using it as training data improved the prediction performance. The validity of the neighborhood cluster approach is indicated by differences in the contributing features and their contribution directions among teleworking levels. Trial Registration: UMIN UMIN000046394; https://www.umin.ac.jp/ctr/index.htm UR - https://ai.jmir.org/2024/1/e55840 UR - http://dx.doi.org/10.2196/55840 UR - http://www.ncbi.nlm.nih.gov/pubmed/39093604 ID - info:doi/10.2196/55840 ER - TY - JOUR AU - Mullick, Tahsin AU - Shaaban, Sam AU - Radovic, Ana AU - Doryab, Afsaneh PY - 2024/5/20 TI - Framework for Ranking Machine Learning Predictions of Limited, Multimodal, and Longitudinal Behavioral Passive Sensing Data: Combining User-Agnostic and Personalized Modeling JO - JMIR AI SP - e47805 VL - 3 KW - machine learning KW - AI KW - artificial intelligence KW - passive sensing KW - ranking framework KW - small health data set KW - ranking KW - algorithm KW - algorithms KW - sensor KW - multimodal KW - predict KW - prediction KW - agnostic KW - framework KW - validation KW - data set N2 - Background: Passive mobile sensing provides opportunities for measuring and monitoring health status in the wild and outside of clinics. However, longitudinal, multimodal mobile sensor data can be small, noisy, and incomplete. This makes processing, modeling, and prediction of these data challenging. The small size of the data set restricts it from being modeled using complex deep learning networks. The current state of the art (SOTA) tackles small sensor data sets following a singular modeling paradigm based on traditional machine learning (ML) algorithms. These opt for either a user-agnostic modeling approach, making the model susceptible to a larger degree of noise, or a personalized approach, where training on individual data alludes to a more limited data set, giving rise to overfitting, therefore, ultimately, having to seek a trade-off by choosing 1 of the 2 modeling approaches to reach predictions. Objective: The objective of this study was to filter, rank, and output the best predictions for small, multimodal, longitudinal sensor data using a framework that is designed to tackle data sets that are limited in size (particularly targeting health studies that use passive multimodal sensors) and that combines both user agnostic and personalized approaches, along with a combination of ranking strategies to filter predictions. Methods: In this paper, we introduced a novel ranking framework for longitudinal multimodal sensors (FLMS) to address challenges encountered in health studies involving passive multimodal sensors. Using the FLMS, we (1) built a tensor-based aggregation and ranking strategy for final interpretation, (2) processed various combinations of sensor fusions, and (3) balanced user-agnostic and personalized modeling approaches with appropriate cross-validation strategies. The performance of the FLMS was validated with the help of a real data set of adolescents diagnosed with major depressive disorder for the prediction of change in depression in the adolescent participants. Results: Predictions output by the proposed FLMS achieved a 7% increase in accuracy and a 13% increase in recall for the real data set. Experiments with existing SOTA ML algorithms showed an 11% increase in accuracy for the depression data set and how overfitting and sparsity were handled. Conclusions: The FLMS aims to fill the gap that currently exists when modeling passive sensor data with a small number of data points. It achieves this through leveraging both user-agnostic and personalized modeling techniques in tandem with an effective ranking strategy to filter predictions. UR - https://ai.jmir.org/2024/1/e47805 UR - http://dx.doi.org/10.2196/47805 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875667 ID - info:doi/10.2196/47805 ER - TY - JOUR AU - Li, Joe AU - Washington, Peter PY - 2024/5/10 TI - A Comparison of Personalized and Generalized Approaches to Emotion Recognition Using Consumer Wearable Devices: Machine Learning Study JO - JMIR AI SP - e52171 VL - 3 KW - affect detection KW - affective computing KW - deep learning KW - digital health KW - emotion recognition KW - machine learning KW - mental health KW - personalization KW - stress detection KW - wearable technology N2 - Background: There are a wide range of potential adverse health effects, ranging from headaches to cardiovascular disease, associated with long-term negative emotions and chronic stress. Because many indicators of stress are imperceptible to observers, the early detection of stress remains a pressing medical need, as it can enable early intervention. Physiological signals offer a noninvasive method for monitoring affective states and are recorded by a growing number of commercially available wearables. Objective: We aim to study the differences between personalized and generalized machine learning models for 3-class emotion classification (neutral, stress, and amusement) using wearable biosignal data. Methods: We developed a neural network for the 3-class emotion classification problem using data from the Wearable Stress and Affect Detection (WESAD) data set, a multimodal data set with physiological signals from 15 participants. We compared the results between a participant-exclusive generalized, a participant-inclusive generalized, and a personalized deep learning model. Results: For the 3-class classification problem, our personalized model achieved an average accuracy of 95.06% and an F1-score of 91.71%; our participant-inclusive generalized model achieved an average accuracy of 66.95% and an F1-score of 42.50%; and our participant-exclusive generalized model achieved an average accuracy of 67.65% and an F1-score of 43.05%. Conclusions: Our results emphasize the need for increased research in personalized emotion recognition models given that they outperform generalized models in certain contexts. We also demonstrate that personalized machine learning models for emotion classification are viable and can achieve high performance. UR - https://ai.jmir.org/2024/1/e52171 UR - http://dx.doi.org/10.2196/52171 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875573 ID - info:doi/10.2196/52171 ER - TY - JOUR AU - Lee, Kyeryoung AU - Liu, Zongzhi AU - Mai, Yun AU - Jun, Tomi AU - Ma, Meng AU - Wang, Tongyu AU - Ai, Lei AU - Calay, Ediz AU - Oh, William AU - Stolovitzky, Gustavo AU - Schadt, Eric AU - Wang, Xiaoyan PY - 2024/7/29 TI - Optimizing Clinical Trial Eligibility Design Using Natural Language Processing Models and Real-World Data: Algorithm Development and Validation JO - JMIR AI SP - e50800 VL - 3 KW - natural language processing KW - real-world data KW - clinical trial eligibility criteria KW - eligibility criteria?specific ontology KW - clinical trial protocol optimization KW - data-driven approach N2 - Background: Clinical trials are vital for developing new therapies but can also delay drug development. Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines. Natural language processing (NLP) has the potential to achieve these objectives. Objective: This study aims to assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients. This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records using deep learning?based NLP techniques. Methods: We obtained data of 3281 industry-sponsored phase 2 or 3 interventional clinical trials recruiting patients with non?small cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn disease from ClinicalTrials.gov, spanning the period between 2013 and 2020. A customized bidirectional long short-term memory? and conditional random field?based NLP pipeline was used to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms along with their corresponding values. To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of patients with non?small cell lung cancer (n=2775), curated from the Mount Sinai Health System, as a pilot study. Results: We manually annotated the clinical trial eligibility corpus (485/3281, 14.78% trials) and constructed an eligibility criteria?specific ontology. Our customized NLP pipeline, developed based on the eligibility criteria?specific ontology that we created through manual annotation, achieved high precision (0.91, range 0.67-1.00) and recall (0.79, range 0.50-1) scores, as well as a high F1-score (0.83, range 0.67-1), enabling the efficient extraction of granular criteria entities and relevant attributes from 3281 clinical trials. A standardized eligibility criteria knowledge base, compatible with electronic health records, was developed by transforming hypernym concepts into machine-interpretable hyponyms along with their corresponding values. In addition, an interface prototype demonstrated the practicality of leveraging real-world data for optimizing clinical trial protocols and identifying eligible patients. Conclusions: Our customized NLP pipeline successfully generated a standardized eligibility criteria knowledge base by transforming hypernym criteria into machine-readable hyponyms along with their corresponding values. A prototype interface integrating real-world patient information allows us to assess the impact of each eligibility criterion on the number of patients eligible for the trial. Leveraging NLP and real-world data in a data-driven approach holds promise for streamlining the overall clinical trial process, optimizing processes, and improving efficiency in patient identification. UR - https://ai.jmir.org/2024/1/e50800 UR - http://dx.doi.org/10.2196/50800 UR - http://www.ncbi.nlm.nih.gov/pubmed/39073872 ID - info:doi/10.2196/50800 ER - TY - JOUR AU - Majdik, P. Zoltan AU - Graham, Scott S. AU - Shiva Edward, C. Jade AU - Rodriguez, N. Sabrina AU - Karnes, S. Martha AU - Jensen, T. Jared AU - Barbour, B. Joshua AU - Rousseau, F. Justin PY - 2024/5/16 TI - Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study JO - JMIR AI SP - e52095 VL - 3 KW - named-entity recognition KW - large language models KW - fine-tuning KW - transfer learning KW - expert annotation KW - annotation KW - sample size KW - sample KW - language model KW - machine learning KW - natural language processing KW - disclosure KW - disclosures KW - statement KW - statements KW - conflict of interest N2 - Background: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. Objective: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. Methods: A random sample of 200 disclosure statements was prepared for annotation. All ?PERSON? and ?ORG? entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. Results: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. Conclusions: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture?s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size. UR - https://ai.jmir.org/2024/1/e52095 UR - http://dx.doi.org/10.2196/52095 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875593 ID - info:doi/10.2196/52095 ER - TY - JOUR AU - De Souza, Jessica AU - Viswanath, Kumar Varun AU - Echterhoff, Maria Jessica AU - Chamberlain, Kristina AU - Wang, Jay Edward PY - 2024/6/24 TI - Augmenting Telepostpartum Care With Vision-Based Detection of Breastfeeding-Related Conditions: Algorithm Development and Validation JO - JMIR AI SP - e54798 VL - 3 KW - remote consultations KW - artificial intelligence KW - AI for health care KW - deep learning KW - detection model KW - breastfeeding KW - telehealth KW - perinatal health KW - image analysis KW - women?s health KW - mobile phone N2 - Background: Breastfeeding benefits both the mother and infant and is a topic of attention in public health. After childbirth, untreated medical conditions or lack of support lead many mothers to discontinue breastfeeding. For instance, nipple damage and mastitis affect 80% and 20% of US mothers, respectively. Lactation consultants (LCs) help mothers with breastfeeding, providing in-person, remote, and hybrid lactation support. LCs guide, encourage, and find ways for mothers to have a better experience breastfeeding. Current telehealth services help mothers seek LCs for breastfeeding support, where images help them identify and address many issues. Due to the disproportional ratio of LCs and mothers in need, these professionals are often overloaded and burned out. Objective: This study aims to investigate the effectiveness of 5 distinct convolutional neural networks in detecting healthy lactating breasts and 6 breastfeeding-related issues by only using red, green, and blue images. Our goal was to assess the applicability of this algorithm as an auxiliary resource for LCs to identify painful breast conditions quickly, better manage their patients through triage, respond promptly to patient needs, and enhance the overall experience and care for breastfeeding mothers. Methods: We evaluated the potential for 5 classification models to detect breastfeeding-related conditions using 1078 breast and nipple images gathered from web-based and physical educational resources. We used the convolutional neural networks Resnet50, Visual Geometry Group model with 16 layers (VGG16), InceptionV3, EfficientNetV2, and DenseNet169 to classify the images across 7 classes: healthy, abscess, mastitis, nipple blebs, dermatosis, engorgement, and nipple damage by improper feeding or misuse of breast pumps. We also evaluated the models? ability to distinguish between healthy and unhealthy images. We present an analysis of the classification challenges, identifying image traits that may confound the detection model. Results: The best model achieves an average area under the receiver operating characteristic curve of 0.93 for all conditions after data augmentation for multiclass classification. For binary classification, we achieved, with the best model, an average area under the curve of 0.96 for all conditions after data augmentation. Several factors contributed to the misclassification of images, including similar visual features in the conditions that precede other conditions (such as the mastitis spectrum disorder), partially covered breasts or nipples, and images depicting multiple conditions in the same breast. Conclusions: This vision-based automated detection technique offers an opportunity to enhance postpartum care for mothers and can potentially help alleviate the workload of LCs by expediting decision-making processes. UR - https://ai.jmir.org/2024/1/e54798 UR - http://dx.doi.org/10.2196/54798 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/54798 ER - TY - JOUR AU - Pan, Cheng AU - Luo, Hao AU - Cheung, Gary AU - Zhou, Huiquan AU - Cheng, Reynold AU - Cullum, Sarah AU - Wu, Chuan PY - 2024/1/31 TI - Identifying Frailty in Older Adults Receiving Home Care Assessment Using Machine Learning: Longitudinal Observational Study on the Role of Classifier, Feature Selection, and Sample Size JO - JMIR AI SP - e44185 VL - 3 KW - machine learning KW - logistic regression KW - frailty KW - older adults KW - home care KW - sample size KW - features KW - data set KW - model KW - mortality prediction KW - assessment N2 - Background: Machine learning techniques are starting to be used in various health care data sets to identify frail persons who may benefit from interventions. However, evidence about the performance of machine learning techniques compared to conventional regression is mixed. It is also unclear what methodological and database factors are associated with performance. Objective: This study aimed to compare the mortality prediction accuracy of various machine learning classifiers for identifying frail older adults in different scenarios. Methods: We used deidentified data collected from older adults (65 years of age and older) assessed with interRAI-Home Care instrument in New Zealand between January 1, 2012, and December 31, 2016. A total of 138 interRAI assessment items were used to predict 6-month and 12-month mortality, using 3 machine learning classifiers (random forest [RF], extreme gradient boosting [XGBoost], and multilayer perceptron [MLP]) and regularized logistic regression. We conducted a simulation study comparing the performance of machine learning models with logistic regression and interRAI Home Care Frailty Scale and examined the effects of sample sizes, the number of features, and train-test split ratios. Results: A total of 95,042 older adults (median age 82.66 years, IQR 77.92-88.76; n=37,462, 39.42% male) receiving home care were analyzed. The average area under the curve (AUC) and sensitivities of 6-month mortality prediction showed that machine learning classifiers did not outperform regularized logistic regressions. In terms of AUC, regularized logistic regression had better performance than XGBoost, MLP, and RF when the number of features was ?80 and the sample size ?16,000; MLP outperformed regularized logistic regression in terms of sensitivities when the number of features was ?40 and the sample size ?4000. Conversely, RF and XGBoost demonstrated higher specificities than regularized logistic regression in all scenarios. Conclusions: The study revealed that machine learning models exhibited significant variation in prediction performance when evaluated using different metrics. Regularized logistic regression was an effective model for identifying frail older adults receiving home care, as indicated by the AUC, particularly when the number of features and sample sizes were not excessively large. Conversely, MLP displayed superior sensitivity, while RF exhibited superior specificity when the number of features and sample sizes were large. UR - https://ai.jmir.org/2024/1/e44185 UR - http://dx.doi.org/10.2196/44185 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/44185 ER - TY - JOUR AU - Kurasawa, Hisashi AU - Waki, Kayo AU - Seki, Tomohisa AU - Chiba, Akihiro AU - Fujino, Akinori AU - Hayashi, Katsuyoshi AU - Nakahara, Eri AU - Haga, Tsuneyuki AU - Noguchi, Takashi AU - Ohe, Kazuhiko PY - 2024/7/18 TI - Enhancing Type 2 Diabetes Treatment Decisions With Interpretable Machine Learning Models for Predicting Hemoglobin A1c Changes: Machine Learning Model Development JO - JMIR AI SP - e56700 VL - 3 KW - AI KW - artificial intelligence KW - attention weight KW - type 2 diabetes KW - blood glucose control KW - machine learning KW - transformer N2 - Background: Type 2 diabetes (T2D) is a significant global health challenge. Physicians need to assess whether future glycemic control will be poor on the current trajectory of usual care and usual-care treatment intensifications so that they can consider taking extra treatment measures to prevent poor outcomes. Predicting poor glycemic control from trends in hemoglobin A1c (HbA1c) levels is difficult due to the influence of seasonal fluctuations and other factors. Objective: We sought to develop a model that accurately predicts poor glycemic control among patients with T2D receiving usual care. Methods: Our machine learning model predicts poor glycemic control (HbA1c?8%) using the transformer architecture, incorporating an attention mechanism to process irregularly spaced HbA1c time series and quantify temporal relationships of past HbA1c levels at each time point. We assessed the model using HbA1c levels from 7787 patients with T2D seeing specialist physicians at the University of Tokyo Hospital. The training data include instances of poor glycemic control occurring during usual care with usual-care treatment intensifications. We compared prediction accuracy, assessed with the area under the receiver operating characteristic curve, the area under the precision-recall curve, and the accuracy rate, to that of LightGBM. Results: The area under the receiver operating characteristic curve, the area under the precision-recall curve, and the accuracy rate (95% confidence limits) of the proposed model were 0.925 (95% CI 0.923-0.928), 0.864 (95% CI 0.852-0.875), and 0.864 (95% CI 0.86-0.869), respectively. The proposed model achieved high prediction accuracy comparable to or surpassing LightGBM?s performance. The model prioritized the most recent HbA1c levels for predictions. Older HbA1c levels in patients with poor glycemic control were slightly more influential in predictions compared to patients with good glycemic control. Conclusions: The proposed model accurately predicts poor glycemic control for patients with T2D receiving usual care, including patients receiving usual-care treatment intensifications, allowing physicians to identify cases warranting extraordinary treatment intensifications. If used by a nonspecialist, the model?s indication of likely future poor glycemic control may warrant a referral to a specialist. Future efforts could incorporate diverse and large-scale clinical data for improved accuracy. UR - https://ai.jmir.org/2024/1/e56700 UR - http://dx.doi.org/10.2196/56700 UR - http://www.ncbi.nlm.nih.gov/pubmed/39024008 ID - info:doi/10.2196/56700 ER - TY - JOUR AU - Chen, Anjun AU - Wu, Erman AU - Huang, Ran AU - Shen, Bairong AU - Han, Ruobing AU - Wen, Jian AU - Zhang, Zhiyong AU - Li, Qinghua PY - 2024/9/11 TI - Development of Lung Cancer Risk Prediction Machine Learning Models for Equitable Learning Health System: Retrospective Study JO - JMIR AI SP - e56590 VL - 3 KW - lung cancer KW - risk prediction KW - early detection KW - learning health system KW - LHS KW - machine learning KW - ML KW - artificial intelligence KW - AI KW - predictive model N2 - Background: A significant proportion of young at-risk patients and nonsmokers are excluded by the current guidelines for lung cancer (LC) screening, resulting in low-screening adoption. The vision of the US National Academy of Medicine to transform health systems into learning health systems (LHS) holds promise for bringing necessary structural changes to health care, thereby addressing the exclusivity and adoption issues of LC screening. Objective: This study aims to realize the LHS vision by designing an equitable, machine learning (ML)?enabled LHS unit for LC screening. It focuses on developing an inclusive and practical LC risk prediction model, suitable for initializing the ML-enabled LHS (ML-LHS) unit. This model aims to empower primary physicians in a clinical research network, linking central hospitals and rural clinics, to routinely deliver risk-based screening for enhancing LC early detection in broader populations. Methods: We created a standardized data set of health factors from 1397 patients with LC and 1448 control patients, all aged 30 years and older, including both smokers and nonsmokers, from a hospital?s electronic medical record system. Initially, a data-centric ML approach was used to create inclusive ML models for risk prediction from all available health factors. Subsequently, a quantitative distribution of LC health factors was used in feature engineering to refine the models into a more practical model with fewer variables. Results: The initial inclusive 250-variable XGBoost model for LC risk prediction achieved performance metrics of 0.86 recall, 0.90 precision, and 0.89 accuracy. Post feature refinement, a practical 29-variable XGBoost model was developed, displaying performance metrics of 0.80 recall, 0.82 precision, and 0.82 accuracy. This model met the criteria for initializing the ML-LHS unit for risk-based, inclusive LC screening within clinical research networks. Conclusions: This study designed an innovative ML-LHS unit for a clinical research network, aiming to sustainably provide inclusive LC screening to all at-risk populations. It developed an inclusive and practical XGBoost model from hospital electronic medical record data, capable of initializing such an ML-LHS unit for community and rural clinics. The anticipated deployment of this ML-LHS unit is expected to significantly improve LC-screening rates and early detection among broader populations, including those typically overlooked by existing screening guidelines. UR - https://ai.jmir.org/2024/1/e56590 UR - http://dx.doi.org/10.2196/56590 UR - http://www.ncbi.nlm.nih.gov/pubmed/39259582 ID - info:doi/10.2196/56590 ER - TY - JOUR AU - Yaseliani, Mohammad AU - Noor-E-Alam, Md AU - Hasan, Mahmudul Md PY - 2024/8/20 TI - Mitigating Sociodemographic Bias in Opioid Use Disorder Prediction: Fairness-Aware Machine Learning Framework JO - JMIR AI SP - e55820 VL - 3 KW - opioid use disorder KW - fairness and bias KW - bias mitigation KW - machine learning KW - majority voting N2 - Background: Opioid use disorder (OUD) is a critical public health crisis in the United States, affecting >5.5 million Americans in 2021. Machine learning has been used to predict patient risk of incident OUD. However, little is known about the fairness and bias of these predictive models. Objective: The aims of this study are two-fold: (1) to develop a machine learning bias mitigation algorithm for sociodemographic features and (2) to develop a fairness-aware weighted majority voting (WMV) classifier for OUD prediction. Methods: We used the 2020 National Survey on Drug and Health data to develop a neural network (NN) model using stochastic gradient descent (SGD; NN-SGD) and an NN model using Adam (NN-Adam) optimizers and evaluated sociodemographic bias by comparing the area under the curve values. A bias mitigation algorithm, based on equality of odds, was implemented to minimize disparities in specificity and recall. Finally, a WMV classifier was developed for fairness-aware prediction of OUD. To further analyze bias detection and mitigation, we did a 1-N matching of OUD to non-OUD cases, controlling for socioeconomic variables, and evaluated the performance of the proposed bias mitigation algorithm and WMV classifier. Results: Our bias mitigation algorithm substantially reduced bias with NN-SGD, by 21.66% for sex, 1.48% for race, and 21.04% for income, and with NN-Adam by 16.96% for sex, 8.87% for marital status, 8.45% for working condition, and 41.62% for race. The fairness-aware WMV classifier achieved a recall of 85.37% and 92.68% and an accuracy of 58.85% and 90.21% using NN-SGD and NN-Adam, respectively. The results after matching also indicated remarkable bias reduction with NN-SGD and NN-Adam, respectively, as follows: sex (0.14% vs 0.97%), marital status (12.95% vs 10.33%), working condition (14.79% vs 15.33%), race (60.13% vs 41.71%), and income (0.35% vs 2.21%). Moreover, the fairness-aware WMV classifier achieved high performance with a recall of 100% and 85.37% and an accuracy of 73.20% and 89.38% using NN-SGD and NN-Adam, respectively. Conclusions: The application of the proposed bias mitigation algorithm shows promise in reducing sociodemographic bias, with the WMV classifier confirming bias reduction and high performance in OUD prediction. UR - https://ai.jmir.org/2024/1/e55820 UR - http://dx.doi.org/10.2196/55820 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/55820 ER - TY - JOUR AU - Patel, Dhavalkumar AU - Timsina, Prem AU - Gorenstein, Larisa AU - Glicksberg, S. Benjamin AU - Raut, Ganesh AU - Cheetirala, Narayan Satya AU - Santana, Fabio AU - Tamegue, Jules AU - Kia, Arash AU - Zimlichman, Eyal AU - Levin, A. Matthew AU - Freeman, Robert AU - Klang, Eyal PY - 2024/8/27 TI - Traditional Machine Learning, Deep Learning, and BERT (Large Language Model) Approaches for Predicting Hospitalizations From Nurse Triage Notes: Comparative Evaluation of Resource Management JO - JMIR AI SP - e52190 VL - 3 KW - Bio-Clinical-BERT KW - term frequency?inverse document frequency KW - TF-IDF KW - health informatics KW - patient care KW - hospital resource management KW - care KW - resource management KW - management KW - language model KW - machine learning KW - hospitalization KW - deep learning KW - logistic regression KW - retrospective analysis KW - training KW - large language model N2 - Background: Predicting hospitalization from nurse triage notes has the potential to augment care. However, there needs to be careful considerations for which models to choose for this goal. Specifically, health systems will have varying degrees of computational infrastructure available and budget constraints. Objective: To this end, we compared the performance of the deep learning, Bidirectional Encoder Representations from Transformers (BERT)?based model, Bio-Clinical-BERT, with a bag-of-words (BOW) logistic regression (LR) model incorporating term frequency?inverse document frequency (TF-IDF). These choices represent different levels of computational requirements. Methods: A retrospective analysis was conducted using data from 1,391,988 patients who visited emergency departments in the Mount Sinai Health System spanning from 2017 to 2022. The models were trained on 4 hospitals? data and externally validated on a fifth hospital?s data. Results: The Bio-Clinical-BERT model achieved higher areas under the receiver operating characteristic curve (0.82, 0.84, and 0.85) compared to the BOW-LR-TF-IDF model (0.81, 0.83, and 0.84) across training sets of 10,000; 100,000; and ~1,000,000 patients, respectively. Notably, both models proved effective at using triage notes for prediction, despite the modest performance gap. Conclusions: Our findings suggest that simpler machine learning models such as BOW-LR-TF-IDF could serve adequately in resource-limited settings. Given the potential implications for patient care and hospital resource management, further exploration of alternative models and techniques is warranted to enhance predictive performance in this critical domain. International Registered Report Identifier (IRRID): RR2-10.1101/2023.08.07.23293699 UR - https://ai.jmir.org/2024/1/e52190 UR - http://dx.doi.org/10.2196/52190 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/52190 ER - TY - JOUR AU - Kamis, Arnold AU - Gadia, Nidhi AU - Luo, Zilin AU - Ng, Xin Shu AU - Thumbar, Mansi PY - 2024/8/29 TI - Obtaining the Most Accurate, Explainable Model for Predicting Chronic Obstructive Pulmonary Disease: Triangulation of Multiple Linear Regression and Machine Learning Methods JO - JMIR AI SP - e58455 VL - 3 KW - chronic obstructive pulmonary disease KW - COPD KW - cigarette smoking KW - ethnic and racial differences KW - machine learning KW - multiple linear regression KW - household income KW - practical model N2 - Background: Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019. Objective: We gathered a diverse set of non?personally identifiable information from public data sources to better understand and predict COPD rates at the core-based statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD. Methods: We integrated non?personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a well-known contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods. Results: The most accurate multiple linear regression model has variance explained of 81.1%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level. Conclusions: This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health. UR - https://ai.jmir.org/2024/1/e58455 UR - http://dx.doi.org/10.2196/58455 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/58455 ER - TY - JOUR AU - Zhang, Boya AU - Naderi, Nona AU - Mishra, Rahul AU - Teodoro, Douglas PY - 2024/5/2 TI - Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and Validation JO - JMIR AI SP - e42630 VL - 3 KW - health misinformation KW - information retrieval KW - deep learning KW - language model KW - transfer learning KW - infodemic N2 - Background: Widespread misinformation in web resources can lead to serious implications for individuals seeking health advice. Despite that, information retrieval models are often focused only on the query-document relevance dimension to rank results. Objective: We investigate a multidimensional information quality retrieval model based on deep learning to enhance the effectiveness of online health care information search results. Methods: In this study, we simulated online health information search scenarios with a topic set of 32 different health-related inquiries and a corpus containing 1 billion web documents from the April 2019 snapshot of Common Crawl. Using state-of-the-art pretrained language models, we assessed the quality of the retrieved documents according to their usefulness, supportiveness, and credibility dimensions for a given search query on 6030 human-annotated, query-document pairs. We evaluated this approach using transfer learning and more specific domain adaptation techniques. Results: In the transfer learning setting, the usefulness model provided the largest distinction between help- and harm-compatible documents, with a difference of +5.6%, leading to a majority of helpful documents in the top 10 retrieved. The supportiveness model achieved the best harm compatibility (+2.4%), while the combination of usefulness, supportiveness, and credibility models achieved the largest distinction between help- and harm-compatibility on helpful topics (+16.9%). In the domain adaptation setting, the linear combination of different models showed robust performance, with help-harm compatibility above +4.4% for all dimensions and going as high as +6.8%. Conclusions: These results suggest that integrating automatic ranking models created for specific information quality dimensions can increase the effectiveness of health-related information retrieval. Thus, our approach could be used to enhance searches made by individuals seeking online health information. UR - https://ai.jmir.org/2024/1/e42630 UR - http://dx.doi.org/10.2196/42630 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875551 ID - info:doi/10.2196/42630 ER - TY - JOUR AU - Baronetto, Annalisa AU - Graf, Luisa AU - Fischer, Sarah AU - Neurath, F. Markus AU - Amft, Oliver PY - 2024/7/10 TI - Multiscale Bowel Sound Event Spotting in Highly Imbalanced Wearable Monitoring Data: Algorithm Development and Validation Study JO - JMIR AI SP - e51118 VL - 3 KW - bowel sound KW - deep learning KW - event spotting KW - wearable sensors N2 - Background: Abdominal auscultation (i.e., listening to bowel sounds (BSs)) can be used to analyze digestion. An automated retrieval of BS would be beneficial to assess gastrointestinal disorders noninvasively. Objective: This study aims to develop a multiscale spotting model to detect BSs in continuous audio data from a wearable monitoring system. Methods: We designed a spotting model based on the Efficient-U-Net (EffUNet) architecture to analyze 10-second audio segments at a time and spot BSs with a temporal resolution of 25 ms. Evaluation data were collected across different digestive phases from 18 healthy participants and 9 patients with inflammatory bowel disease (IBD). Audio data were recorded in a daytime setting with a smart T-Shirt that embeds digital microphones. The data set was annotated by independent raters with substantial agreement (Cohen ? between 0.70 and 0.75), resulting in 136 hours of labeled data. In total, 11,482 BSs were analyzed, with a BS duration ranging between 18 ms and 6.3 seconds. The share of BSs in the data set (BS ratio) was 0.0089. We analyzed the performance depending on noise level, BS duration, and BS event rate. We also report spotting timing errors. Results: Leave-one-participant-out cross-validation of BS event spotting yielded a median F1-score of 0.73 for both healthy volunteers and patients with IBD. EffUNet detected BSs under different noise conditions with 0.73 recall and 0.72 precision. In particular, for a signal-to-noise ratio over 4 dB, more than 83% of BSs were recognized, with precision of 0.77 or more. EffUNet recall dropped below 0.60 for BS duration of 1.5 seconds or less. At a BS ratio greater than 0.05, the precision of our model was over 0.83. For both healthy participants and patients with IBD, insertion and deletion timing errors were the largest, with a total of 15.54 minutes of insertion errors and 13.08 minutes of deletion errors over the total audio data set. On our data set, EffUNet outperformed existing BS spotting models that provide similar temporal resolution. Conclusions: The EffUNet spotter is robust against background noise and can retrieve BSs with varying duration. EffUNet outperforms previous BS detection approaches in unmodified audio data, containing highly sparse BS events. UR - https://ai.jmir.org/2024/1/e51118 UR - http://dx.doi.org/10.2196/51118 UR - http://www.ncbi.nlm.nih.gov/pubmed/38985504 ID - info:doi/10.2196/51118 ER - TY - JOUR AU - Ravaut, Mathieu AU - Zhao, Ruochen AU - Phung, Duy AU - Qin, Mengqi Vicky AU - Milovanovic, Dusan AU - Pienkowska, Anita AU - Bojic, Iva AU - Car, Josip AU - Joty, Shafiq PY - 2024/10/30 TI - Targeting COVID-19 and Human Resources for Health News Information Extraction: Algorithm Development and Validation JO - JMIR AI SP - e55059 VL - 3 KW - COVID-19 KW - SARS-CoV-2 KW - summary KW - summarize KW - news articles KW - deep learning KW - classification KW - summarization KW - machine learning KW - extract KW - extraction KW - news KW - media KW - NLP KW - natural language processing N2 - Background: Global pandemics like COVID-19 put a high amount of strain on health care systems and health workers worldwide. These crises generate a vast amount of news information published online across the globe. This extensive corpus of articles has the potential to provide valuable insights into the nature of ongoing events and guide interventions and policies. However, the sheer volume of information is beyond the capacity of human experts to process and analyze effectively. Objective: The aim of this study was to explore how natural language processing (NLP) can be leveraged to build a system that allows for quick analysis of a high volume of news articles. Along with this, the objective was to create a workflow comprising human-computer symbiosis to derive valuable insights to support health workforce strategic policy dialogue, advocacy, and decision-making. Methods: We conducted a review of open-source news coverage from January 2020 to June 2022 on COVID-19 and its impacts on the health workforce from the World Health Organization (WHO) Epidemic Intelligence from Open Sources (EIOS) by synergizing NLP models, including classification and extractive summarization, and human-generated analyses. Our DeepCovid system was trained on 2.8 million news articles in English from more than 3000 internet sources across hundreds of jurisdictions. Results: Rules-based classification with hand-designed rules narrowed the data set to 8508 articles with high relevancy confirmed in the human-led evaluation. DeepCovid?s automated information targeting component reached a very strong binary classification performance of 98.98 for the area under the receiver operating characteristic curve (ROC-AUC) and 47.21 for the area under the precision recall curve (PR-AUC). Its information extraction component attained good performance in automatic extractive summarization with a mean Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score of 47.76. DeepCovid?s final summaries were used by human experts to write reports on the COVID-19 pandemic. Conclusions: It is feasible to synergize high-performing NLP models and human-generated analyses to benefit open-source health workforce intelligence. The DeepCovid approach can contribute to an agile and timely global view, providing complementary information to scientific literature. UR - https://ai.jmir.org/2024/1/e55059 UR - http://dx.doi.org/10.2196/55059 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/55059 ER - TY - JOUR AU - Rodriguez, V. Danissa AU - Chen, Ji AU - Viswanadham, N. Ratnalekha V. AU - Lawrence, Katharine AU - Mann, Devin PY - 2024/3/1 TI - Leveraging Machine Learning to Develop Digital Engagement Phenotypes of Users in a Digital Diabetes Prevention Program: Evaluation Study JO - JMIR AI SP - e47122 VL - 3 KW - machine learning KW - digital health KW - diabetes KW - mobile health KW - messaging platforms KW - user engagement KW - patient behavior KW - digital diabetes prevention programs KW - digital phenotypes KW - digital prescription KW - users KW - prevention KW - evaluation study KW - communication KW - support KW - engagement KW - phenotypes KW - digital health intervention KW - chronic disease management N2 - Background: Digital diabetes prevention programs (dDPPs) are effective ?digital prescriptions? but have high attrition rates and program noncompletion. To address this, we developed a personalized automatic messaging system (PAMS) that leverages SMS text messaging and data integration into clinical workflows to increase dDPP engagement via enhanced patient-provider communication. Preliminary data showed positive results. However, further investigation is needed to determine how to optimize the tailoring of support technology such as PAMS based on a user?s preferences to boost their dDPP engagement. Objective: This study evaluates leveraging machine learning (ML) to develop digital engagement phenotypes of dDPP users and assess ML?s accuracy in predicting engagement with dDPP activities. This research will be used in a PAMS optimization process to improve PAMS personalization by incorporating engagement prediction and digital phenotyping. This study aims (1) to prove the feasibility of using dDPP user-collected data to build an ML model that predicts engagement and contributes to identifying digital engagement phenotypes, (2) to describe methods for developing ML models with dDPP data sets and present preliminary results, and (3) to present preliminary data on user profiling based on ML model outputs. Methods: Using the gradient-boosted forest model, we predicted engagement in 4 dDPP individual activities (physical activity, lessons, social activity, and weigh-ins) and general activity (engagement in any activity) based on previous short- and long-term activity in the app. The area under the receiver operating characteristic curve, the area under the precision-recall curve, and the Brier score metrics determined the performance of the model. Shapley values reflected the feature importance of the models and determined what variables informed user profiling through latent profile analysis. Results: We developed 2 models using weekly and daily DPP data sets (328,821 and 704,242 records, respectively), which yielded predictive accuracies above 90%. Although both models were highly accurate, the daily model better fitted our research plan because it predicted daily changes in individual activities, which was crucial for creating the ?digital phenotypes.? To better understand the variables contributing to the model predictor, we calculated the Shapley values for both models to identify the features with the highest contribution to model fit; engagement with any activity in the dDPP in the last 7 days had the most predictive power. We profiled users with latent profile analysis after 2 weeks of engagement (Bayesian information criterion=?3222.46) with the dDPP and identified 6 profiles of users, including those with high engagement, minimal engagement, and attrition. Conclusions: Preliminary results demonstrate that applying ML methods with predicting power is an acceptable mechanism to tailor and optimize messaging interventions to support patient engagement and adherence to digital prescriptions. The results enable future optimization of our existing messaging platform and expansion of this methodology to other clinical domains. Trial Registration: ClinicalTrials.gov NCT04773834; https://www.clinicaltrials.gov/ct2/show/NCT04773834 International Registered Report Identifier (IRRID): RR2-10.2196/26750 UR - https://ai.jmir.org/2024/1/e47122 UR - http://dx.doi.org/10.2196/47122 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875579 ID - info:doi/10.2196/47122 ER - TY - JOUR AU - Harrison, M. Rachel AU - Lapteva, Ekaterina AU - Bibin, Anton PY - 2024/10/15 TI - Behavioral Nudging With Generative AI for Content Development in SMS Health Care Interventions: Case Study JO - JMIR AI SP - e52974 VL - 3 KW - generative artificial intelligence KW - generative AI KW - prompt engineering KW - large language models KW - GPT KW - content design KW - brief message interventions KW - mHealth KW - behavior change techniques KW - medication adherence KW - type 2 diabetes N2 - Background: Brief message interventions have demonstrated immense promise in health care, yet the development of these messages has suffered from a dearth of transparency and a scarcity of publicly accessible data sets. Moreover, the researcher-driven content creation process has raised resource allocation issues, necessitating a more efficient and transparent approach to content development. Objective: This research sets out to address the challenges of content development for SMS interventions by showcasing the use of generative artificial intelligence (AI) as a tool for content creation, transparently explaining the prompt design and content generation process, and providing the largest publicly available data set of brief messages and source code for future replication of our process. Methods: Leveraging the pretrained large language model GPT-3.5 (OpenAI), we generate a collection of messages in the context of medication adherence for individuals with type 2 diabetes using evidence-derived behavior change techniques identified in a prior systematic review. We create an attributed prompt designed to adhere to content (readability and tone) and SMS (character count and encoder type) standards while encouraging message variability to reflect differences in behavior change techniques. Results: We deliver the most extensive repository of brief messages for a singular health care intervention and the first library of messages crafted with generative AI. In total, our method yields a data set comprising 1150 messages, with 89.91% (n=1034) meeting character length requirements and 80.7% (n=928) meeting readability requirements. Furthermore, our analysis reveals that all messages exhibit diversity comparable to an existing publicly available data set created under the same theoretical framework for a similar setting. Conclusions: This research provides a novel approach to content creation for health care interventions using state-of-the-art generative AI tools. Future research is needed to assess the generated content for ethical, safety, and research standards, as well as to determine whether the intervention is successful in improving the target behaviors. UR - https://ai.jmir.org/2024/1/e52974 UR - http://dx.doi.org/10.2196/52974 UR - http://www.ncbi.nlm.nih.gov/pubmed/39405108 ID - info:doi/10.2196/52974 ER - TY - JOUR AU - Yan, Runze AU - Liu, Xinwen AU - Dutcher, M. Janine AU - Tumminia, J. Michael AU - Villalba, Daniella AU - Cohen, Sheldon AU - Creswell, D. John AU - Creswell, Kasey AU - Mankoff, Jennifer AU - Dey, K. Anind AU - Doryab, Afsaneh PY - 2024/4/18 TI - Identifying Links Between Productivity and Biobehavioral Rhythms Modeled From Multimodal Sensor Streams: Exploratory Quantitative Study JO - JMIR AI SP - e47194 VL - 3 KW - biobehavioral rhythms KW - productivity KW - computational modeling KW - mobile sensing KW - mobile phone N2 - Background: Biobehavioral rhythms are biological, behavioral, and psychosocial processes with repeating cycles. Abnormal rhythms have been linked to various health issues, such as sleep disorders, obesity, and depression. Objective: This study aims to identify links between productivity and biobehavioral rhythms modeled from passively collected mobile data streams. Methods: In this study, we used a multimodal mobile sensing data set consisting of data collected from smartphones and Fitbits worn by 188 college students over a continuous period of 16 weeks. The participants reported their self-evaluated daily productivity score (ranging from 0 to 4) during weeks 1, 6, and 15. To analyze the data, we modeled cyclic human behavior patterns based on multimodal mobile sensing data gathered during weeks 1, 6, 15, and the adjacent weeks. Our methodology resulted in the creation of a rhythm model for each sensor feature. Additionally, we developed a correlation-based approach to identify connections between rhythm stability and high or low productivity levels. Results: Differences exist in the biobehavioral rhythms of high- and low-productivity students, with those demonstrating greater rhythm stability also exhibiting higher productivity levels. Notably, a negative correlation (C=?0.16) was observed between productivity and the SE of the phase for the 24-hour period during week 1, with a higher SE indicative of lower rhythm stability. Conclusions: Modeling biobehavioral rhythms has the potential to quantify and forecast productivity. The findings have implications for building novel cyber-human systems that align with human beings? biobehavioral rhythms to improve health, well-being, and work performance. UR - https://ai.jmir.org/2024/1/e47194 UR - http://dx.doi.org/10.2196/47194 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/47194 ER - TY - JOUR AU - van Buchem, Meija Marieke AU - Kant, J. Ilse M. AU - King, Liza AU - Kazmaier, Jacqueline AU - Steyerberg, W. Ewout AU - Bauer, P. Martijn PY - 2024/9/23 TI - Impact of a Digital Scribe System on Clinical Documentation Time and Quality: Usability Study JO - JMIR AI SP - e60020 VL - 3 KW - large language model KW - large language models KW - LLM KW - LLMs KW - natural language processing KW - NLP KW - deep learning KW - pilot study KW - pilot studies KW - implementation KW - machine learning KW - ML KW - artificial intelligence KW - AI KW - algorithm KW - algorithms KW - model KW - models KW - analytics KW - practical model KW - practical models KW - automation KW - automate KW - documentation KW - documentation time KW - documentation quality KW - clinical documentation N2 - Background: Physicians spend approximately half of their time on administrative tasks, which is one of the leading causes of physician burnout and decreased work satisfaction. The implementation of natural language processing?assisted clinical documentation tools may provide a solution. Objective: This study investigates the impact of a commercially available Dutch digital scribe system on clinical documentation efficiency and quality. Methods: Medical students with experience in clinical practice and documentation (n=22) created a total of 430 summaries of mock consultations and recorded the time they spent on this task. The consultations were summarized using 3 methods: manual summaries, fully automated summaries, and automated summaries with manual editing. We then randomly reassigned the summaries and evaluated their quality using a modified version of the Physician Documentation Quality Instrument (PDQI-9). We compared the differences between the 3 methods in descriptive statistics, quantitative text metrics (word count and lexical diversity), the PDQI-9, Recall-Oriented Understudy for Gisting Evaluation scores, and BERTScore. Results: The median time for manual summarization was 202 seconds against 186 seconds for editing an automatic summary. Without editing, the automatic summaries attained a poorer PDQI-9 score than manual summaries (median PDQI-9 score 25 vs 31, P<.001, ANOVA test). Automatic summaries were found to have higher word counts but lower lexical diversity than manual summaries (P<.001, independent t test). The study revealed variable impacts on PDQI-9 scores and summarization time across individuals. Generally, students viewed the digital scribe system as a potentially useful tool, noting its ease of use and time-saving potential, though some criticized the summaries for their greater length and rigid structure. Conclusions: This study highlights the potential of digital scribes in improving clinical documentation processes by offering a first summary draft for physicians to edit, thereby reducing documentation time without compromising the quality of patient records. Furthermore, digital scribes may be more beneficial to some physicians than to others and could play a role in improving the reusability of clinical documentation. Future studies should focus on the impact and quality of such a system when used by physicians in clinical practice. UR - https://ai.jmir.org/2024/1/e60020 UR - http://dx.doi.org/10.2196/60020 UR - http://www.ncbi.nlm.nih.gov/pubmed/39312397 ID - info:doi/10.2196/60020 ER - TY - JOUR AU - Sebo, Paul PY - 2024/3/13 TI - What Is the Performance of ChatGPT in Determining the Gender of Individuals Based on Their First and Last Names? JO - JMIR AI SP - e53656 VL - 3 KW - accuracy KW - artificial intelligence KW - AI KW - ChatGPT KW - gender KW - gender detection tool KW - misclassification KW - name KW - performance KW - gender detection KW - gender detection tools KW - inequalities KW - language model KW - NamSor KW - Gender API KW - Switzerland KW - physicians KW - gender bias KW - disparities KW - gender disparities KW - gender gap UR - https://ai.jmir.org/2024/1/e53656 UR - http://dx.doi.org/10.2196/53656 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/53656 ER - TY - JOUR AU - Young, A. Joshua AU - Chang, Chin-Wen AU - Scales, W. Charles AU - Menon, V. Saurabh AU - Holy, E. Chantal AU - Blackie, Adrienne Caroline PY - 2024/3/12 TI - Machine Learning Methods Using Artificial Intelligence Deployed on Electronic Health Record Data for Identification and Referral of At-Risk Patients From Primary Care Physicians to Eye Care Specialists: Retrospective, Case-Controlled Study JO - JMIR AI SP - e48295 VL - 3 KW - decision support for health professionals KW - tools, programs and algorithms KW - electronic health record KW - primary care KW - artificial intelligence KW - AI KW - prediction accuracy KW - triaging KW - AI model KW - eye care KW - ophthalmic N2 - Background: Identification and referral of at-risk patients from primary care practitioners (PCPs) to eye care professionals remain a challenge. Approximately 1.9 million Americans suffer from vision loss as a result of undiagnosed or untreated ophthalmic conditions. In ophthalmology, artificial intelligence (AI) is used to predict glaucoma progression, recognize diabetic retinopathy (DR), and classify ocular tumors; however, AI has not yet been used to triage primary care patients for ophthalmology referral. Objective: This study aimed to build and compare machine learning (ML) methods, applicable to electronic health records (EHRs) of PCPs, capable of triaging patients for referral to eye care specialists. Methods: Accessing the Optum deidentified EHR data set, 743,039 patients with 5 leading vision conditions (age-related macular degeneration [AMD], visually significant cataract, DR, glaucoma, or ocular surface disease [OSD]) were exact-matched on age and gender to 743,039 controls without eye conditions. Between 142 and 182 non-ophthalmic parameters per patient were input into 5 ML methods: generalized linear model, L1-regularized logistic regression, random forest, Extreme Gradient Boosting (XGBoost), and J48 decision tree. Model performance was compared for each pathology to select the most predictive algorithm. The area under the curve (AUC) was assessed for all algorithms for each outcome. Results: XGBoost demonstrated the best performance, showing, respectively, a prediction accuracy and an AUC of 78.6% (95% CI 78.3%-78.9%) and 0.878 for visually significant cataract, 77.4% (95% CI 76.7%-78.1%) and 0.858 for exudative AMD, 79.2% (95% CI 78.8%-79.6%) and 0.879 for nonexudative AMD, 72.2% (95% CI 69.9%-74.5%) and 0.803 for OSD requiring medication, 70.8% (95% CI 70.5%-71.1%) and 0.785 for glaucoma, 85.0% (95% CI 84.2%-85.8%) and 0.924 for type 1 nonproliferative diabetic retinopathy (NPDR), 82.2% (95% CI 80.4%-84.0%) and 0.911 for type 1 proliferative diabetic retinopathy (PDR), 81.3% (95% CI 81.0%-81.6%) and 0.891 for type 2 NPDR, and 82.1% (95% CI 81.3%-82.9%) and 0.900 for type 2 PDR. Conclusions: The 5 ML methods deployed were able to successfully identify patients with elevated odds ratios (ORs), thus capable of patient triage, for ocular pathology ranging from 2.4 (95% CI 2.4-2.5) for glaucoma to 5.7 (95% CI 5.0-6.4) for type 1 NPDR, with an average OR of 3.9. The application of these models could enable PCPs to better identify and triage patients at risk for treatable ophthalmic pathology. Early identification of patients with unrecognized sight-threatening conditions may lead to earlier treatment and a reduced economic burden. More importantly, such triage may improve patients? lives. UR - https://ai.jmir.org/2024/1/e48295 UR - http://dx.doi.org/10.2196/48295 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875582 ID - info:doi/10.2196/48295 ER - TY - JOUR AU - Goh, WB Wilson AU - Chia, YA Kendrick AU - Cheung, FK Max AU - Kee, M. Kalya AU - Lwin, O. May AU - Schulz, J. Peter AU - Chen, Minhu AU - Wu, Kaichun AU - Ng, SM Simon AU - Lui, Rashid AU - Ang, Leong Tiing AU - Yeoh, Guan Khay AU - Chiu, Han-mo AU - Wu, Deng-chyang AU - Sung, JY Joseph PY - 2024/3/7 TI - Risk Perception, Acceptance, and Trust of Using AI in Gastroenterology Practice in the Asia-Pacific Region: Web-Based Survey Study JO - JMIR AI SP - e50525 VL - 3 KW - artificial intelligence KW - delivery of health care KW - gastroenterology KW - acceptance KW - trust KW - adoption KW - survey KW - surveys KW - questionnaire KW - questionnaires KW - detect KW - detection KW - colonoscopy KW - gastroenterologist KW - gastroenterologists KW - internal medicine KW - polyp KW - polyps KW - surgeon KW - surgeons KW - surgery KW - surgical KW - colorectal N2 - Background: The use of artificial intelligence (AI) can revolutionize health care, but this raises risk concerns. It is therefore crucial to understand how clinicians trust and accept AI technology. Gastroenterology, by its nature of being an image-based and intervention-heavy specialty, is an area where AI-assisted diagnosis and management can be applied extensively. Objective: This study aimed to study how gastroenterologists or gastrointestinal surgeons accept and trust the use of AI in computer-aided detection (CADe), computer-aided characterization (CADx), and computer-aided intervention (CADi) of colorectal polyps in colonoscopy. Methods: We conducted a web-based questionnaire from November 2022 to January 2023, involving 5 countries or areas in the Asia-Pacific region. The questionnaire included variables such as background and demography of users; intention to use AI, perceived risk; acceptance; and trust in AI-assisted detection, characterization, and intervention. We presented participants with 3 AI scenarios related to colonoscopy and the management of colorectal polyps. These scenarios reflect existing AI applications in colonoscopy, namely the detection of polyps (CADe), characterization of polyps (CADx), and AI-assisted polypectomy (CADi). Results: In total, 165 gastroenterologists and gastrointestinal surgeons responded to a web-based survey using the structured questionnaire designed by experts in medical communications. Participants had a mean age of 44 (SD 9.65) years, were mostly male (n=116, 70.3%), and mostly worked in publicly funded hospitals (n=110, 66.67%). Participants reported relatively high exposure to AI, with 111 (67.27%) reporting having used AI for clinical diagnosis or treatment of digestive diseases. Gastroenterologists are highly interested to use AI in diagnosis but show different levels of reservations in risk prediction and acceptance of AI. Most participants (n=112, 72.72%) also expressed interest to use AI in their future practice. CADe was accepted by 83.03% (n=137) of respondents, CADx was accepted by 78.79% (n=130), and CADi was accepted by 72.12% (n=119). CADe and CADx were trusted by 85.45% (n=141) of respondents and CADi was trusted by 72.12% (n=119). There were no application-specific differences in risk perceptions, but more experienced clinicians gave lesser risk ratings. Conclusions: Gastroenterologists reported overall high acceptance and trust levels of using AI-assisted colonoscopy in the management of colorectal polyps. However, this level of trust depends on the application scenario. Moreover, the relationship among risk perception, acceptance, and trust in using AI in gastroenterology practice is not straightforward. UR - https://ai.jmir.org/2024/1/e50525 UR - http://dx.doi.org/10.2196/50525 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875591 ID - info:doi/10.2196/50525 ER - TY - JOUR AU - Odabashian, Roupen AU - Bastin, Donald AU - Jones, Georden AU - Manzoor, Maria AU - Tangestaniapour, Sina AU - Assad, Malke AU - Lakhani, Sunita AU - Odabashian, Maritsa AU - McGee, Sharon PY - 2024/1/12 TI - Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks JO - JMIR AI SP - e50442 VL - 3 KW - artificial intelligence KW - ChatGPT-3.5 KW - language model KW - medical oncology N2 - Background: ChatGPT (Open AI) is a state-of-the-art large language model that uses artificial intelligence (AI) to address questions across diverse topics. The American Society of Clinical Oncology Self-Evaluation Program (ASCO-SEP) created a comprehensive educational program to help physicians keep up to date with the many rapid advances in the field. The question bank consists of multiple choice questions addressing the many facets of cancer care, including diagnosis, treatment, and supportive care. As ChatGPT applications rapidly expand, it becomes vital to ascertain if the knowledge of ChatGPT-3.5 matches the established standards that oncologists are recommended to follow. Objective: This study aims to evaluate whether ChatGPT-3.5?s knowledge aligns with the established benchmarks that oncologists are expected to adhere to. This will furnish us with a deeper understanding of the potential applications of this tool as a support for clinical decision-making. Methods: We conducted a systematic assessment of the performance of ChatGPT-3.5 on the ASCO-SEP, the leading educational and assessment tool for medical oncologists in training and practice. Over 1000 multiple choice questions covering the spectrum of cancer care were extracted. Questions were categorized by cancer type or discipline, with subcategorization as treatment, diagnosis, or other. Answers were scored as correct if ChatGPT-3.5 selected the answer as defined by ASCO-SEP. Results: Overall, ChatGPT-3.5 achieved a score of 56.1% (583/1040) for the correct answers provided. The program demonstrated varying levels of accuracy across cancer types or disciplines. The highest accuracy was observed in questions related to developmental therapeutics (8/10; 80% correct), while the lowest accuracy was observed in questions related to gastrointestinal cancer (102/209; 48.8% correct). There was no significant difference in the program?s performance across the predefined subcategories of diagnosis, treatment, and other (P=.16, which is greater than .05). Conclusions: This study evaluated ChatGPT-3.5?s oncology knowledge using the ASCO-SEP, aiming to address uncertainties regarding AI tools like ChatGPT in clinical decision-making. Our findings suggest that while ChatGPT-3.5 offers a hopeful outlook for AI in oncology, its present performance in ASCO-SEP tests necessitates further refinement to reach the requisite competency levels. Future assessments could explore ChatGPT?s clinical decision support capabilities with real-world clinical scenarios, its ease of integration into medical workflows, and its potential to foster interdisciplinary collaboration and patient engagement in health care settings. UR - https://ai.jmir.org/2024/1/e50442 UR - http://dx.doi.org/10.2196/50442 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/50442 ER - TY - JOUR AU - Ewals, S. Lotte J. AU - Heesterbeek, J. Lynn J. AU - Yu, Bin AU - van der Wulp, Kasper AU - Mavroeidis, Dimitrios AU - Funk, Mathias AU - Snijders, P. Chris C. AU - Jacobs, Igor AU - Nederend, Joost AU - Pluyter, R. Jon AU - PY - 2024/3/13 TI - The Impact of Expectation Management and Model Transparency on Radiologists? Trust and Utilization of AI Recommendations for Lung Nodule Assessment on Computed Tomography: Simulated Use Study JO - JMIR AI SP - e52211 VL - 3 KW - application KW - artificial intelligence KW - AI KW - computer-aided detection or diagnosis KW - CAD KW - design KW - human centered KW - human computer interaction KW - HCI KW - interaction KW - mental model KW - radiologists KW - trust N2 - Background: Many promising artificial intelligence (AI) and computer-aided detection and diagnosis systems have been developed, but few have been successfully integrated into clinical practice. This is partially owing to a lack of user-centered design of AI-based computer-aided detection or diagnosis (AI-CAD) systems. Objective: We aimed to assess the impact of different onboarding tutorials and levels of AI model explainability on radiologists? trust in AI and the use of AI recommendations in lung nodule assessment on computed tomography (CT) scans. Methods: In total, 20 radiologists from 7 Dutch medical centers performed lung nodule assessment on CT scans under different conditions in a simulated use study as part of a 2×2 repeated-measures quasi-experimental design. Two types of AI onboarding tutorials (reflective vs informative) and 2 levels of AI output (black box vs explainable) were designed. The radiologists first received an onboarding tutorial that was either informative or reflective. Subsequently, each radiologist assessed 7 CT scans, first without AI recommendations. AI recommendations were shown to the radiologist, and they could adjust their initial assessment. Half of the participants received the recommendations via black box AI output and half received explainable AI output. Mental model and psychological trust were measured before onboarding, after onboarding, and after assessing the 7 CT scans. We recorded whether radiologists changed their assessment on found nodules, malignancy prediction, and follow-up advice for each CT assessment. In addition, we analyzed whether radiologists? trust in their assessments had changed based on the AI recommendations. Results: Both variations of onboarding tutorials resulted in a significantly improved mental model of the AI-CAD system (informative P=.01 and reflective P=.01). After using AI-CAD, psychological trust significantly decreased for the group with explainable AI output (P=.02). On the basis of the AI recommendations, radiologists changed the number of reported nodules in 27 of 140 assessments, malignancy prediction in 32 of 140 assessments, and follow-up advice in 12 of 140 assessments. The changes were mostly an increased number of reported nodules, a higher estimated probability of malignancy, and earlier follow-up. The radiologists? confidence in their found nodules changed in 82 of 140 assessments, in their estimated probability of malignancy in 50 of 140 assessments, and in their follow-up advice in 28 of 140 assessments. These changes were predominantly increases in confidence. The number of changed assessments and radiologists? confidence did not significantly differ between the groups that received different onboarding tutorials and AI outputs. Conclusions: Onboarding tutorials help radiologists gain a better understanding of AI-CAD and facilitate the formation of a correct mental model. If AI explanations do not consistently substantiate the probability of malignancy across patient cases, radiologists? trust in the AI-CAD system can be impaired. Radiologists? confidence in their assessments was improved by using the AI recommendations. UR - https://ai.jmir.org/2024/1/e52211 UR - http://dx.doi.org/10.2196/52211 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875574 ID - info:doi/10.2196/52211 ER - TY - JOUR AU - Spina, Aidin AU - Andalib, Saman AU - Flores, Daniel AU - Vermani, Rishi AU - Halaseh, F. Faris AU - Nelson, M. Ariana PY - 2024/8/13 TI - Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study JO - JMIR AI SP - e54371 VL - 3 KW - generative language model KW - GLM KW - artificial intelligence KW - AI KW - low health literacy KW - LHL KW - readability KW - GLMs KW - language model KW - language models KW - health literacy KW - understandable KW - understandability KW - knowledge translation KW - comprehension KW - generative KW - NLP KW - natural language processing KW - reading level KW - reading levels KW - education KW - medical text KW - medical texts KW - medical information KW - health information N2 - Background: Although uncertainties exist regarding implementation, artificial intelligence?driven generative language models (GLMs) have enormous potential in medicine. Deployment of GLMs could improve patient comprehension of clinical texts and improve low health literacy. Objective: The goal of this study is to evaluate the potential of ChatGPT-3.5 and GPT-4 to tailor the complexity of medical information to patient-specific input education level, which is crucial if it is to serve as a tool in addressing low health literacy. Methods: Input templates related to 2 prevalent chronic diseases?type II diabetes and hypertension?were designed. Each clinical vignette was adjusted for hypothetical patient education levels to evaluate output personalization. To assess the success of a GLM (GPT-3.5 and GPT-4) in tailoring output writing, the readability of pre- and posttransformation outputs were quantified using the Flesch reading ease score (FKRE) and the Flesch-Kincaid grade level (FKGL). Results: Responses (n=80) were generated using GPT-3.5 and GPT-4 across 2 clinical vignettes. For GPT-3.5, FKRE means were 57.75 (SD 4.75), 51.28 (SD 5.14), 32.28 (SD 4.52), and 28.31 (SD 5.22) for 6th grade, 8th grade, high school, and bachelor?s, respectively; FKGL mean scores were 9.08 (SD 0.90), 10.27 (SD 1.06), 13.4 (SD 0.80), and 13.74 (SD 1.18). GPT-3.5 only aligned with the prespecified education levels at the bachelor?s degree. Conversely, GPT-4?s FKRE mean scores were 74.54 (SD 2.6), 71.25 (SD 4.96), 47.61 (SD 6.13), and 13.71 (SD 5.77), with FKGL mean scores of 6.3 (SD 0.73), 6.7 (SD 1.11), 11.09 (SD 1.26), and 17.03 (SD 1.11) for the same respective education levels. GPT-4 met the target readability for all groups except the 6th-grade FKRE average. Both GLMs produced outputs with statistically significant differences (P<.001; 8th grade P<.001; high school P<.001; bachelors P=.003; FKGL: 6th grade P=.001; 8th grade P<.001; high school P<.001; bachelors P<.001) between mean FKRE and FKGL across input education levels. Conclusions: GLMs can change the structure and readability of medical text outputs according to input-specified education. However, GLMs categorize input education designation into 3 broad tiers of output readability: easy (6th and 8th grade), medium (high school), and difficult (bachelor?s degree). This is the first result to suggest that there are broader boundaries in the success of GLMs in output text simplification. Future research must establish how GLMs can reliably personalize medical texts to prespecified education levels to enable a broader impact on health care literacy. UR - https://ai.jmir.org/2024/1/e54371 UR - http://dx.doi.org/10.2196/54371 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/54371 ER - TY - JOUR AU - Waheed, Atif Muhammad AU - Liu, Lu PY - 2024/4/17 TI - Perceptions of Family Physicians About Applying AI in Primary Health Care: Case Study From a Premier Health Care Organization JO - JMIR AI SP - e40781 VL - 3 KW - AI KW - artificial intelligence KW - perception KW - attitude KW - opinion KW - surveys and questionnaires KW - family physician KW - primary care KW - health care service provider KW - health care professional KW - ethical KW - AI decision-making KW - AI challenges N2 - Background: The COVID-19 pandemic has led to the rapid proliferation of artificial intelligence (AI), which was not previously anticipated; this is an unforeseen development. The use of AI in health care settings is increasing, as it proves to be a promising tool for transforming health care systems, improving operational and business processes, and efficiently simplifying health care tasks for family physicians and health care administrators. Therefore, it is necessary to assess the perspective of family physicians on AI and its impact on their job roles. Objective: This study aims to determine the impact of AI on the management and practices of Qatar?s Primary Health Care Corporation (PHCC) in improving health care tasks and service delivery. Furthermore, it seeks to evaluate the impact of AI on family physicians? job roles, including associated risks and ethical ramifications from their perspective. Methods: We conducted a cross-sectional survey and sent a web-based questionnaire survey link to 724 practicing family physicians at the PHCC. In total, we received 102 eligible responses. Results: Of the 102 respondents, 72 (70.6%) were men and 94 (92.2%) were aged between 35 and 54 years. In addition, 58 (56.9%) of the 102 respondents were consultants. The overall awareness of AI was 80 (78.4%) out of 102, with no difference between gender (P=.06) and age groups (P=.12). AI is perceived to play a positive role in improving health care practices at PHCC (P<.001), managing health care tasks (P<.001), and positively impacting health care service delivery (P<.001). Family physicians also perceived that their clinical, administrative, and opportunistic health care management roles were positively influenced by AI (P<.001). Furthermore, perceptions of family physicians indicate that AI improves operational and human resource management (P<.001), does not undermine patient-physician relationships (P<.001), and is not considered superior to human physicians in the clinical judgment process (P<.001). However, its inclusion is believed to decrease patient satisfaction (P<.001). AI decision-making and accountability were recognized as ethical risks, along with data protection and confidentiality. The optimism regarding using AI for future medical decisions was low among family physicians. Conclusions: This study indicated a positive perception among family physicians regarding AI integration into primary care settings. AI demonstrates significant potential for enhancing health care task management and overall service delivery at the PHCC. It augments family physicians? roles without replacing them and proves beneficial for operational efficiency, human resource management, and public health during pandemics. While the implementation of AI is anticipated to bring benefits, the careful consideration of ethical, privacy, confidentiality, and patient-centric concerns is essential. These insights provide valuable guidance for the strategic integration of AI into health care systems, with a focus on maintaining high-quality patient care and addressing the multifaceted challenges that arise during this transformative process. UR - https://ai.jmir.org/2024/1/e40781 UR - http://dx.doi.org/10.2196/40781 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875531 ID - info:doi/10.2196/40781 ER - TY - JOUR AU - Noda, Masao AU - Yoshimura, Hidekane AU - Okubo, Takuya AU - Koshu, Ryota AU - Uchiyama, Yuki AU - Nomura, Akihiro AU - Ito, Makoto AU - Takumi, Yutaka PY - 2024/5/31 TI - Feasibility of Multimodal Artificial Intelligence Using GPT-4 Vision for the Classification of Middle Ear Disease: Qualitative Study and Validation JO - JMIR AI SP - e58342 VL - 3 KW - artificial intelligence KW - deep learning KW - machine learning KW - generative AI KW - generative KW - tympanic membrane KW - middle ear disease KW - GPT4-Vision KW - otolaryngology KW - ears KW - ear KW - tympanic KW - vision KW - GPT KW - GPT4V KW - otoscopic KW - image KW - images KW - imaging KW - diagnosis KW - diagnoses KW - diagnostic KW - diagnostics KW - otitis KW - mobile phone N2 - Background: The integration of artificial intelligence (AI), particularly deep learning models, has transformed the landscape of medical technology, especially in the field of diagnosis using imaging and physiological data. In otolaryngology, AI has shown promise in image classification for middle ear diseases. However, existing models often lack patient-specific data and clinical context, limiting their universal applicability. The emergence of GPT-4 Vision (GPT-4V) has enabled a multimodal diagnostic approach, integrating language processing with image analysis. Objective: In this study, we investigated the effectiveness of GPT-4V in diagnosing middle ear diseases by integrating patient-specific data with otoscopic images of the tympanic membrane. Methods: The design of this study was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images. In total, 305 otoscopic images of 4 middle ear diseases (acute otitis media, middle ear cholesteatoma, chronic otitis media, and otitis media with effusion) were obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023. The optimized GPT-4V settings were established using prompts and patients? data, and the model created with the optimal prompt was used to verify the diagnostic accuracy of GPT-4V on 190 images. To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images. Results: The multimodal AI approach achieved an accuracy of 82.1%, which is superior to that of certified pediatricians at 70.6%, but trailing behind that of otolaryngologists at more than 95%. The model?s disease-specific accuracy rates were 89.2% for acute otitis media, 76.5% for chronic otitis media, 79.3% for middle ear cholesteatoma, and 85.7% for otitis media with effusion, which highlights the need for disease-specific optimization. Comparisons with physicians revealed promising results, suggesting the potential of GPT-4V to augment clinical decision-making. Conclusions: Despite its advantages, challenges such as data privacy and ethical considerations must be addressed. Overall, this study underscores the potential of multimodal AI for enhancing diagnostic accuracy and improving patient care in otolaryngology. Further research is warranted to optimize and validate this approach in diverse clinical settings. UR - https://ai.jmir.org/2024/1/e58342 UR - http://dx.doi.org/10.2196/58342 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875669 ID - info:doi/10.2196/58342 ER - TY - JOUR AU - Hammoud, Mohammad AU - Douglas, Shahd AU - Darmach, Mohamad AU - Alawneh, Sara AU - Sanyal, Swapnendu AU - Kanbour, Youssef PY - 2024/4/29 TI - Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study JO - JMIR AI SP - e46875 VL - 3 KW - digital health KW - symptom checker KW - artificial intelligence KW - AI KW - patient-centered care KW - eHealth apps KW - eHealth N2 - Background: Medical self-diagnostic tools (or symptom checkers) are becoming an integral part of digital health and our daily lives, whereby patients are increasingly using them to identify the underlying causes of their symptoms. As such, it is essential to rigorously investigate and comprehensively report the diagnostic performance of symptom checkers using standard clinical and scientific approaches. Objective: This study aims to evaluate and report the accuracies of a few known and new symptom checkers using a standard and transparent methodology, which allows the scientific community to cross-validate and reproduce the reported results, a step much needed in health informatics. Methods: We propose a 4-stage experimentation methodology that capitalizes on the standard clinical vignette approach to evaluate 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced primary care physicians. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 (SD 9.42) years. To measure accuracy, we used 7 standard metrics, including M1 as a measure of a symptom checker?s or a physician?s ability to return a vignette?s main diagnosis at the top of their differential list, F1-score as a trade-off measure between recall and precision, and Normalized Discounted Cumulative Gain (NDCG) as a measure of a differential list?s ranking quality, among others. Results: The diagnostic accuracies of the 6 tested symptom checkers vary significantly. For instance, the differences in the M1, F1-score, and NDCG results between the best-performing and worst-performing symptom checkers or ranges were 65.3%, 39.2%, and 74.2%, respectively. The same was observed among the participating human physicians, whereby the M1, F1-score, and NDCG ranges were 22.8%, 15.3%, and 21.3%, respectively. When compared against each other, physicians outperformed the best-performing symptom checker by an average of 1.2% using F1-score, whereas the best-performing symptom checker outperformed physicians by averages of 10.2% and 25.1% using M1 and NDCG, respectively. Conclusions: The performance variation between symptom checkers is substantial, suggesting that symptom checkers cannot be treated as a single entity. On a different note, the best-performing symptom checker was an artificial intelligence (AI)?based one, shedding light on the promise of AI in improving the diagnostic capabilities of symptom checkers, especially as AI keeps advancing exponentially. UR - https://ai.jmir.org/2024/1/e46875 UR - http://dx.doi.org/10.2196/46875 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875676 ID - info:doi/10.2196/46875 ER - TY - JOUR AU - Khademi, Sedigh AU - Palmer, Christopher AU - Javed, Muhammad AU - Dimaguila, Luis Gerardo AU - Clothier, Hazel AU - Buttery, Jim AU - Black, Jim PY - 2024/8/30 TI - Near Real-Time Syndromic Surveillance of Emergency Department Triage Texts Using Natural Language Processing: Case Study in Febrile Convulsion Detection JO - JMIR AI SP - e54449 VL - 3 KW - vaccine safety KW - immunization KW - febrile convulsion KW - syndromic surveillance KW - emergency department KW - natural language processing N2 - Background: Collecting information on adverse events following immunization from as many sources as possible is critical for promptly identifying potential safety concerns and taking appropriate actions. Febrile convulsions are recognized as an important potential reaction to vaccination in children aged <6 years. Objective: The primary aim of this study was to evaluate the performance of natural language processing techniques and machine learning (ML) models for the rapid detection of febrile convulsion presentations in emergency departments (EDs), especially with respect to the minimum training data requirements to obtain optimum model performance. In addition, we examined the deployment requirements for a ML model to perform real-time monitoring of ED triage notes. Methods: We developed a pattern matching approach as a baseline and evaluated ML models for the classification of febrile convulsions in ED triage notes to determine both their training requirements and their effectiveness in detecting febrile convulsions. We measured their performance during training and then compared the deployed models? result on new incoming ED data. Results: Although the best standard neural networks had acceptable performance and were low-resource models, transformer-based models outperformed them substantially, justifying their ongoing deployment. Conclusions: Using natural language processing, particularly with the use of large language models, offers significant advantages in syndromic surveillance. Large language models make highly effective classifiers, and their text generation capacity can be used to enhance the quality and diversity of training data. UR - https://ai.jmir.org/2024/1/e54449 UR - http://dx.doi.org/10.2196/54449 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/54449 ER - TY - JOUR AU - Lu, Qiuhao AU - Wen, Andrew AU - Nguyen, Thien AU - Liu, Hongfang PY - 2024/8/6 TI - Enhancing Clinical Relevance of Pretrained Language Models Through Integration of External Knowledge: Case Study on Cardiovascular Diagnosis From Electronic Health Records JO - JMIR AI SP - e56932 VL - 3 KW - knowledge integration KW - pre-trained language models KW - physician reasoning KW - adapters KW - physician KW - physicians KW - electronic health record KW - electronic health records KW - EHR KW - healthcare KW - heterogeneous KW - healthcare institution KW - healthcare institutions KW - proprietary information KW - healthcare data KW - methodology KW - text classification KW - data privacy KW - medical knowledge N2 - Background: Despite their growing use in health care, pretrained language models (PLMs) often lack clinical relevance due to insufficient domain expertise and poor interpretability. A key strategy to overcome these challenges is integrating external knowledge into PLMs, enhancing their adaptability and clinical usefulness. Current biomedical knowledge graphs like UMLS (Unified Medical Language System), SNOMED CT (Systematized Medical Nomenclature for Medicine?Clinical Terminology), and HPO (Human Phenotype Ontology), while comprehensive, fail to effectively connect general biomedical knowledge with physician insights. There is an equally important need for a model that integrates diverse knowledge in a way that is both unified and compartmentalized. This approach not only addresses the heterogeneous nature of domain knowledge but also recognizes the unique data and knowledge repositories of individual health care institutions, necessitating careful and respectful management of proprietary information. Objective: This study aimed to enhance the clinical relevance and interpretability of PLMs by integrating external knowledge in a manner that respects the diversity and proprietary nature of health care data. We hypothesize that domain knowledge, when captured and distributed as stand-alone modules, can be effectively reintegrated into PLMs to significantly improve their adaptability and utility in clinical settings. Methods: We demonstrate that through adapters, small and lightweight neural networks that enable the integration of extra information without full model fine-tuning, we can inject diverse sources of external domain knowledge into language models and improve the overall performance with an increased level of interpretability. As a practical application of this methodology, we introduce a novel task, structured as a case study, that endeavors to capture physician knowledge in assigning cardiovascular diagnoses from clinical narratives, where we extract diagnosis-comment pairs from electronic health records (EHRs) and cast the problem as text classification. Results: The study demonstrates that integrating domain knowledge into PLMs significantly improves their performance. While improvements with ClinicalBERT are more modest, likely due to its pretraining on clinical texts, BERT (bidirectional encoder representations from transformer) equipped with knowledge adapters surprisingly matches or exceeds ClinicalBERT in several metrics. This underscores the effectiveness of knowledge adapters and highlights their potential in settings with strict data privacy constraints. This approach also increases the level of interpretability of these models in a clinical context, which enhances our ability to precisely identify and apply the most relevant domain knowledge for specific tasks, thereby optimizing the model?s performance and tailoring it to meet specific clinical needs. Conclusions: This research provides a basis for creating health knowledge graphs infused with physician knowledge, marking a significant step forward for PLMs in health care. Notably, the model balances integrating knowledge both comprehensively and selectively, addressing the heterogeneous nature of medical knowledge and the privacy needs of health care institutions. UR - https://ai.jmir.org/2024/1/e56932 UR - http://dx.doi.org/10.2196/56932 UR - http://www.ncbi.nlm.nih.gov/pubmed/39106099 ID - info:doi/10.2196/56932 ER - TY - JOUR AU - Racine, Nicole AU - Chow, Cheryl AU - Hamwi, Lojain AU - Bucsea, Oana AU - Cheng, Carol AU - Du, Hang AU - Fabrizi, Lorenzo AU - Jasim, Sara AU - Johannsson, Lesley AU - Jones, Laura AU - Laudiano-Dray, Pureza Maria AU - Meek, Judith AU - Mistry, Neelum AU - Shah, Vibhuti AU - Stedman, Ian AU - Wang, Xiaogang AU - Riddell, Pillai Rebecca PY - 2024/2/9 TI - Health Care Professionals? and Parents? Perspectives on the Use of AI for Pain Monitoring in the Neonatal Intensive Care Unit: Multisite Qualitative Study JO - JMIR AI SP - e51535 VL - 3 KW - pain monitoring KW - pain management KW - preterm infant KW - neonate KW - pain KW - infant KW - infants KW - neonates KW - newborn KW - newborns KW - neonatal KW - baby KW - babies KW - pediatric KW - pediatrics KW - preterm KW - premature KW - assessment KW - intensive care KW - NICU KW - neonatal intensive care unit KW - HCP KW - health care professional KW - health care professionals KW - experience KW - experiences KW - attitude KW - attitudes KW - opinion KW - perception KW - perceptions KW - perspective KW - perspectives KW - acceptance KW - adoption KW - willingness KW - artificial intelligence KW - AI KW - digital health KW - health technology KW - health technologies KW - interview KW - interviews KW - parent KW - parents N2 - Background: The use of artificial intelligence (AI) for pain assessment has the potential to address historical challenges in infant pain assessment. There is a dearth of information on the perceived benefits and barriers to the implementation of AI for neonatal pain monitoring in the neonatal intensive care unit (NICU) from the perspective of health care professionals (HCPs) and parents. This qualitative analysis provides novel data obtained from 2 large tertiary care hospitals in Canada and the United Kingdom. Objective: The aim of the study is to explore the perspectives of HCPs and parents regarding the use of AI for pain assessment in the NICU. Methods: In total, 20 HCPs and 20 parents of preterm infants were recruited and consented to participate from February 2020 to October 2022 in interviews asking about AI use for pain assessment in the NICU, potential benefits of the technology, and potential barriers to use. Results: The 40 participants included 20 HCPs (17 women and 3 men) with an average of 19.4 (SD 10.69) years of experience in the NICU and 20 parents (mean age 34.4, SD 5.42 years) of preterm infants who were on average 43 (SD 30.34) days old. Six themes from the perspective of HCPs were identified: regular use of technology in the NICU, concerns with regard to AI integration, the potential to improve patient care, requirements for implementation, AI as a tool for pain assessment, and ethical considerations. Seven parent themes included the potential for improved care, increased parental distress, support for parents regarding AI, the impact on parent engagement, the importance of human care, requirements for integration, and the desire for choice in its use. A consistent theme was the importance of AI as a tool to inform clinical decision-making and not replace it. Conclusions: HCPs and parents expressed generally positive sentiments about the potential use of AI for pain assessment in the NICU, with HCPs highlighting important ethical considerations. This study identifies critical methodological and ethical perspectives from key stakeholders that should be noted by any team considering the creation and implementation of AI for pain monitoring in the NICU. UR - https://ai.jmir.org/2024/1/e51535 UR - http://dx.doi.org/10.2196/51535 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875686 ID - info:doi/10.2196/51535 ER - TY - JOUR AU - Prescott, R. Maximo AU - Yeager, Samantha AU - Ham, Lillian AU - Rivera Saldana, D. Carlos AU - Serrano, Vanessa AU - Narez, Joey AU - Paltin, Dafna AU - Delgado, Jorge AU - Moore, J. David AU - Montoya, Jessica PY - 2024/8/2 TI - Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses JO - JMIR AI SP - e54482 VL - 3 KW - GenAI KW - generative artificial intelligence KW - ChatGPT KW - Bard KW - qualitative research KW - thematic analysis KW - digital health N2 - Background: Qualitative methods are incredibly beneficial to the dissemination and implementation of new digital health interventions; however, these methods can be time intensive and slow down dissemination when timely knowledge from the data sources is needed in ever-changing health systems. Recent advancements in generative artificial intelligence (GenAI) and their underlying large language models (LLMs) may provide a promising opportunity to expedite the qualitative analysis of textual data, but their efficacy and reliability remain unknown. Objective: The primary objectives of our study were to evaluate the consistency in themes, reliability of coding, and time needed for inductive and deductive thematic analyses between GenAI (ie, ChatGPT and Bard) and human coders. Methods: The qualitative data for this study consisted of 40 brief SMS text message reminder prompts used in a digital health intervention for promoting antiretroviral medication adherence among people with HIV who use methamphetamine. Inductive and deductive thematic analyses of these SMS text messages were conducted by 2 independent teams of human coders. An independent human analyst conducted analyses following both approaches using ChatGPT and Bard. The consistency in themes (or the extent to which the themes were the same) and reliability (or agreement in coding of themes) between methods were compared. Results: The themes generated by GenAI (both ChatGPT and Bard) were consistent with 71% (5/7) of the themes identified by human analysts following inductive thematic analysis. The consistency in themes was lower between humans and GenAI following a deductive thematic analysis procedure (ChatGPT: 6/12, 50%; Bard: 7/12, 58%). The percentage agreement (or intercoder reliability) for these congruent themes between human coders and GenAI ranged from fair to moderate (ChatGPT, inductive: 31/66, 47%; ChatGPT, deductive: 22/59, 37%; Bard, inductive: 20/54, 37%; Bard, deductive: 21/58, 36%). In general, ChatGPT and Bard performed similarly to each other across both types of qualitative analyses in terms of consistency of themes (inductive: 6/6, 100%; deductive: 5/6, 83%) and reliability of coding (inductive: 23/62, 37%; deductive: 22/47, 47%). On average, GenAI required significantly less overall time than human coders when conducting qualitative analysis (20, SD 3.5 min vs 567, SD 106.5 min). Conclusions: The promising consistency in the themes generated by human coders and GenAI suggests that these technologies hold promise in reducing the resource intensiveness of qualitative thematic analysis; however, the relatively lower reliability in coding between them suggests that hybrid approaches are necessary. Human coders appeared to be better than GenAI at identifying nuanced and interpretative themes. Future studies should consider how these powerful technologies can be best used in collaboration with human coders to improve the efficiency of qualitative research in hybrid approaches while also mitigating potential ethical risks that they may pose. UR - https://ai.jmir.org/2024/1/e54482 UR - http://dx.doi.org/10.2196/54482 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/54482 ER - TY - JOUR AU - Wiepert, Daniela AU - Malin, A. Bradley AU - Duffy, R. Joseph AU - Utianski, L. Rene AU - Stricker, L. John AU - Jones, T. David AU - Botha, Hugo PY - 2024/3/15 TI - Reidentification of Participants in Shared Clinical Data Sets: Experimental Study JO - JMIR AI SP - e52054 VL - 3 KW - reidentification KW - privacy KW - adversarial attack KW - health care KW - speech disorders KW - voiceprint N2 - Background: Large curated data sets are required to leverage speech-based tools in health care. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (ie, voiceprints), sharing recordings raises privacy concerns. This is especially relevant when working with patient data protected under the Health Insurance Portability and Accountability Act. Objective: We aimed to determine the reidentification risk for speech recordings, without reference to demographics or metadata, in clinical data sets considering both the size of the search space (ie, the number of comparisons that must be considered when reidentifying) and the nature of the speech recording (ie, the type of speech task). Methods: Using a state-of-the-art speaker identification model, we modeled an adversarial attack scenario in which an adversary uses a large data set of identified speech (hereafter, the known set) to reidentify as many unknown speakers in a shared data set (hereafter, the unknown set) as possible. We first considered the effect of search space size by attempting reidentification with various sizes of known and unknown sets using VoxCeleb, a data set with recordings of natural, connected speech from >7000 healthy speakers. We then repeated these tests with different types of recordings in each set to examine whether the nature of a speech recording influences reidentification risk. For these tests, we used our clinical data set composed of recordings of elicited speech tasks from 941 speakers. Results: We found that the risk was inversely related to the number of comparisons an adversary must consider (ie, the search space), with a positive linear correlation between the number of false acceptances (FAs) and the number of comparisons (r=0.69; P<.001). The true acceptances (TAs) stayed relatively stable, and the ratio between FAs and TAs rose from 0.02 at 1 × 105 comparisons to 1.41 at 6 × 106 comparisons, with a near 1:1 ratio at the midpoint of 3 × 106 comparisons. In effect, risk was high for a small search space but dropped as the search space grew. We also found that the nature of a speech recording influenced reidentification risk, with nonconnected speech (eg, vowel prolongation: FA/TA=98.5; alternating motion rate: FA/TA=8) being harder to identify than connected speech (eg, sentence repetition: FA/TA=0.54) in cross-task conditions. The inverse was mostly true in within-task conditions, with the FA/TA ratio for vowel prolongation and alternating motion rate dropping to 0.39 and 1.17, respectively. Conclusions: Our findings suggest that speaker identification models can be used to reidentify participants in specific circumstances, but in practice, the reidentification risk appears small. The variation in risk due to search space size and type of speech task provides actionable recommendations to further increase participant privacy and considerations for policy regarding public release of speech recordings. UR - https://ai.jmir.org/2024/1/e52054 UR - http://dx.doi.org/10.2196/52054 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875581 ID - info:doi/10.2196/52054 ER - TY - JOUR AU - Tao, Jinxin AU - Larson, G. Ramsey AU - Mintz, Yonatan AU - Alagoz, Oguzhan AU - Hoppe, K. Kara PY - 2024/9/13 TI - Predictive Modeling of Hypertension-Related Postpartum Readmission: Retrospective Cohort Analysis JO - JMIR AI SP - e48588 VL - 3 KW - pregnancy KW - postpartum KW - hypertension KW - preeclampsia KW - blood pressure KW - hospital readmission, clinical calculator KW - healthcare cost KW - cost KW - cohort analysis KW - utilization KW - resources KW - labor KW - women KW - risk KW - readmission KW - cohort KW - hospital KW - statistical model KW - retrospective cohort study KW - predict N2 - Background: Hypertension is the most common reason for postpartum hospital readmission. Better prediction of postpartum readmission will improve the health care of patients. These models will allow better use of resources and decrease health care costs. Objective: This study aimed to evaluate clinical predictors of postpartum readmission for hypertension using a novel machine learning (ML) model that can effectively predict readmissions and balance treatment costs. We examined whether blood pressure and other measures during labor, not just postpartum measures, would be important predictors of readmission. Methods: We conducted a retrospective cohort study from the PeriData website data set from a single midwestern academic center of all women who delivered from 2009 to 2018. This study consists of 2 data sets; 1 spanning the years 2009-2015 and the other spanning the years 2016-2018. A total of 47 clinical and demographic variables were collected including blood pressure measurements during labor and post partum, laboratory values, and medication administration. Hospital readmissions were verified by patient chart review. In total, 32,645 were considered in the study. For our analysis, we trained several cost-sensitive ML models to predict the primary outcome of hypertension-related postpartum readmission within 42 days post partum. Models were evaluated using cross-validation and on independent data sets (models trained on data from 2009 to 2015 were validated on the data from 2016 to 2018). To assess clinical viability, a cost analysis of the models was performed to see how their recommendations could affect treatment costs. Results: Of the 32,645 patients included in the study, 170 were readmitted due to a hypertension-related diagnosis. A cost-sensitive random forest method was found to be the most effective with a balanced accuracy of 76.61% for predicting readmission. Using a feature importance and area under the curve analysis, the most important variables for predicting readmission were blood pressures in labor and 24-48 hours post partum increasing the area under the curve of the model from 0.69 (SD 0.06) to 0.81 (SD 0.06), (P=.05). Cost analysis showed that the resulting model could have reduced associated readmission costs by US $6000 against comparable models with similar F1-score and balanced accuracy. The most effective model was then implemented as a risk calculator that is publicly available. The code for this calculator and the model is also publicly available at a GitHub repository. Conclusions: Blood pressure measurements during labor through 48 hours post partum can be combined with other variables to predict women at risk for postpartum readmission. Using ML techniques in conjunction with these data have the potential to improve health outcomes and reduce associated costs. The use of the calculator can greatly assist clinicians in providing care to patients and improve medical decision-making. UR - https://ai.jmir.org/2024/1/e48588 UR - http://dx.doi.org/10.2196/48588 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/48588 ER - TY - JOUR AU - Siegel, Nicole Leeann AU - Wiseman, P. Kara AU - Budenz, Alex AU - Prutzman, Yvonne PY - 2024/5/22 TI - Identifying Patterns of Smoking Cessation App Feature Use That Predict Successful Quitting: Secondary Analysis of Experimental Data Leveraging Machine Learning JO - JMIR AI SP - e51756 VL - 3 KW - smartphone apps KW - machine learning KW - artificial intelligence KW - smoking cessation KW - mHealth KW - mobile health KW - app KW - apps KW - applications KW - application feature KW - features KW - smoking KW - smoke KW - smoker KW - smokers KW - cessation KW - quit KW - quitting KW - algorithm KW - algorithms KW - mobile phone N2 - Background: Leveraging free smartphone apps can help expand the availability and use of evidence-based smoking cessation interventions. However, there is a need for additional research investigating how the use of different features within such apps impacts their effectiveness. Objective: We used observational data collected from an experiment of a publicly available smoking cessation app to develop supervised machine learning (SML) algorithms intended to distinguish the app features that promote successful smoking cessation. We then assessed the extent to which patterns of app feature use accounted for variance in cessation that could not be explained by other known predictors of cessation (eg, tobacco use behaviors). Methods: Data came from an experiment (ClinicalTrials.gov NCT04623736) testing the impacts of incentivizing ecological momentary assessments within the National Cancer Institute?s quitSTART app. Participants? (N=133) app activity, including every action they took within the app and its corresponding time stamp, was recorded. Demographic and baseline tobacco use characteristics were measured at the start of the experiment, and short-term smoking cessation (7-day point prevalence abstinence) was measured at 4 weeks after baseline. Logistic regression SML modeling was used to estimate participants? probability of cessation from 28 variables reflecting participants? use of different app features, assigned experimental conditions, and phone type (iPhone [Apple Inc] or Android [Google]). The SML model was first fit in a training set (n=100) and then its accuracy was assessed in a held-aside test set (n=33). Within the test set, a likelihood ratio test (n=30) assessed whether adding individuals? SML-predicted probabilities of cessation to a logistic regression model that included demographic and tobacco use (eg, polyuse) variables explained additional variance in 4-week cessation. Results: The SML model?s sensitivity (0.67) and specificity (0.67) in the held-aside test set indicated that individuals? patterns of using different app features predicted cessation with reasonable accuracy. The likelihood ratio test showed that the logistic regression, which included the SML model?predicted probabilities, was statistically equivalent to the model that only included the demographic and tobacco use variables (P=.16). Conclusions: Harnessing user data through SML could help determine the features of smoking cessation apps that are most useful. This methodological approach could be applied in future research focusing on smoking cessation app features to inform the development and improvement of smoking cessation apps. Trial Registration: ClinicalTrials.gov NCT04623736; https://clinicaltrials.gov/study/NCT04623736 UR - https://ai.jmir.org/2024/1/e51756 UR - http://dx.doi.org/10.2196/51756 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875564 ID - info:doi/10.2196/51756 ER - TY - JOUR AU - Huang, Jingyi AU - Guo, Peiqi AU - Zhang, Sheng AU - Ji, Mengmeng AU - An, Ruopeng PY - 2024/7/25 TI - Use of Deep Neural Networks to Predict Obesity With Short Audio Recordings: Development and Usability Study JO - JMIR AI SP - e54885 VL - 3 KW - obesity KW - obese KW - overweight KW - voice KW - vocal KW - vocal cord KW - vocal cords KW - voice-based KW - machine learning KW - ML KW - artificial intelligence KW - AI KW - algorithm KW - algorithms KW - predictive model KW - predictive models KW - predictive analytics KW - predictive system KW - practical model KW - practical models KW - early warning KW - early detection KW - deep neural network KW - deep neural networks KW - DNN KW - artificial neural network KW - artificial neural networks KW - deep learning N2 - Background: The escalating global prevalence of obesity has necessitated the exploration of novel diagnostic approaches. Recent scientific inquiries have indicated potential alterations in voice characteristics associated with obesity, suggesting the feasibility of using voice as a noninvasive biomarker for obesity detection. Objective: This study aims to use deep neural networks to predict obesity status through the analysis of short audio recordings, investigating the relationship between vocal characteristics and obesity. Methods: A pilot study was conducted with 696 participants, using self-reported BMI to classify individuals into obesity and nonobesity groups. Audio recordings of participants reading a short script were transformed into spectrograms and analyzed using an adapted YOLOv8 model (Ultralytics). The model performance was evaluated using accuracy, recall, precision, and F1-scores. Results: The adapted YOLOv8 model demonstrated a global accuracy of 0.70 and a macro F1-score of 0.65. It was more effective in identifying nonobesity (F1-score of 0.77) than obesity (F1-score of 0.53). This moderate level of accuracy highlights the potential and challenges in using vocal biomarkers for obesity detection. Conclusions: While the study shows promise in the field of voice-based medical diagnostics for obesity, it faces limitations such as reliance on self-reported BMI data and a small, homogenous sample size. These factors, coupled with variability in recording quality, necessitate further research with more robust methodologies and diverse samples to enhance the validity of this novel approach. The findings lay a foundational step for future investigations in using voice as a noninvasive biomarker for obesity detection. UR - https://ai.jmir.org/2024/1/e54885 UR - http://dx.doi.org/10.2196/54885 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/54885 ER - TY - JOUR AU - Mostafapour, Mehrnaz AU - Fortier, H. Jacqueline AU - Pacheco, Karen AU - Murray, Heather AU - Garber, Gary PY - 2024/8/19 TI - Evaluating Literature Reviews Conducted by Humans Versus ChatGPT: Comparative Study JO - JMIR AI SP - e56537 VL - 3 KW - OpenAIs KW - chatGPT KW - AI vs. human KW - literature search KW - Chat GPT performance evaluation KW - large language models KW - artificial intelligence KW - AI KW - algorithm KW - algorithms KW - predictive model KW - predictive models KW - literature review KW - literature reviews N2 - Background: With the rapid evolution of artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT-4 (OpenAI), there is an increasing interest in their potential to assist in scholarly tasks, including conducting literature reviews. However, the efficacy of AI-generated reviews compared with traditional human-led approaches remains underexplored. Objective: This study aims to compare the quality of literature reviews conducted by the ChatGPT-4 model with those conducted by human researchers, focusing on the relational dynamics between physicians and patients. Methods: We included 2 literature reviews in the study on the same topic, namely, exploring factors affecting relational dynamics between physicians and patients in medicolegal contexts. One review used GPT-4, last updated in September 2021, and the other was conducted by human researchers. The human review involved a comprehensive literature search using medical subject headings and keywords in Ovid MEDLINE, followed by a thematic analysis of the literature to synthesize information from selected articles. The AI-generated review used a new prompt engineering approach, using iterative and sequential prompts to generate results. Comparative analysis was based on qualitative measures such as accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency. Results: GPT-4 produced an extensive list of relational factors rapidly. The AI model demonstrated an impressive breadth of knowledge but exhibited limitations in in-depth and contextual understanding, occasionally producing irrelevant or incorrect information. In comparison, human researchers provided a more nuanced and contextually relevant review. The comparative analysis assessed the reviews based on criteria including accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency. While GPT-4 showed advantages in response time and breadth of knowledge, human-led reviews excelled in accuracy, depth of knowledge, and contextual understanding. Conclusions: The study suggests that GPT-4, with structured prompt engineering, can be a valuable tool for conducting preliminary literature reviews by providing a broad overview of topics quickly. However, its limitations necessitate careful expert evaluation and refinement, making it an assistant rather than a substitute for human expertise in comprehensive literature reviews. Moreover, this research highlights the potential and limitations of using AI tools like GPT-4 in academic research, particularly in the fields of health services and medical research. It underscores the necessity of combining AI?s rapid information retrieval capabilities with human expertise for more accurate and contextually rich scholarly outputs. UR - https://ai.jmir.org/2024/1/e56537 UR - http://dx.doi.org/10.2196/56537 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/56537 ER - TY - JOUR AU - Agmon, Shunit AU - Singer, Uriel AU - Radinsky, Kira PY - 2024/10/2 TI - Leveraging Temporal Trends for Training Contextual Word Embeddings to Address Bias in Biomedical Applications: Development Study JO - JMIR AI SP - e49546 VL - 3 KW - natural language processing KW - NLP KW - BERT KW - word embeddings KW - statistical models KW - bias KW - algorithms KW - gender N2 - Background: Women have been underrepresented in clinical trials for many years. Machine-learning models trained on clinical trial abstracts may capture and amplify biases in the data. Specifically, word embeddings are models that enable representing words as vectors and are the building block of most natural language processing systems. If word embeddings are trained on clinical trial abstracts, predictive models that use the embeddings will exhibit gender performance gaps. Objective: We aim to capture temporal trends in clinical trials through temporal distribution matching on contextual word embeddings (specifically, BERT) and explore its effect on the bias manifested in downstream tasks. Methods: We present TeDi-BERT, a method to harness the temporal trend of increasing women?s inclusion in clinical trials to train contextual word embeddings. We implement temporal distribution matching through an adversarial classifier, trying to distinguish old from new clinical trial abstracts based on their embeddings. The temporal distribution matching acts as a form of domain adaptation from older to more recent clinical trials. We evaluate our model on 2 clinical tasks: prediction of unplanned readmission to the intensive care unit and hospital length of stay prediction. We also conduct an algorithmic analysis of the proposed method. Results: In readmission prediction, TeDi-BERT achieved area under the receiver operating characteristic curve of 0.64 for female patients versus the baseline of 0.62 (P<.001), and 0.66 for male patients versus the baseline of 0.64 (P<.001). In the length of stay regression, TeDi-BERT achieved a mean absolute error of 4.56 (95% CI 4.44-4.68) for female patients versus 4.62 (95% CI 4.50-4.74, P<.001) and 4.54 (95% CI 4.44-4.65) for male patients versus 4.6 (95% CI 4.50-4.71, P<.001). Conclusions: In both clinical tasks, TeDi-BERT improved performance for female patients, as expected; but it also improved performance for male patients. Our results show that accuracy for one gender does not need to be exchanged for bias reduction, but rather that good science improves clinical results for all. Contextual word embedding models trained to capture temporal trends can help mitigate the effects of bias that changes over time in the training data. UR - https://ai.jmir.org/2024/1/e49546 UR - http://dx.doi.org/10.2196/49546 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/49546 ER - TY - JOUR AU - Deng, Tianjie AU - Urbaczewski, Andrew AU - Lee, Jin Young AU - Barman-Adhikari, Anamika AU - Dewri, Rinku PY - 2024/10/17 TI - Identifying Marijuana Use Behaviors Among Youth Experiencing Homelessness Using a Machine Learning?Based Framework: Development and Evaluation Study JO - JMIR AI SP - e53488 VL - 3 KW - machine learning KW - youth experiencing homelessness KW - natural language processing KW - infodemiology KW - social good KW - digital intervention N2 - Background: Youth experiencing homelessness face substance use problems disproportionately compared to other youth. A study found that 69% of youth experiencing homelessness meet the criteria for dependence on at least 1 substance, compared to 1.8% for all US adolescents. In addition, they experience major structural and social inequalities, which further undermine their ability to receive the care they need. Objective: The goal of this study was to develop a machine learning?based framework that uses the social media content (posts and interactions) of youth experiencing homelessness to predict their substance use behaviors (ie, the probability of using marijuana). With this framework, social workers and care providers can identify and reach out to youth experiencing homelessness who are at a higher risk of substance use. Methods: We recruited 133 young people experiencing homelessness at a nonprofit organization located in a city in the western United States. After obtaining their consent, we collected the participants? social media conversations for the past year before they were recruited, and we asked the participants to complete a survey on their demographic information, health conditions, sexual behaviors, and substance use behaviors. Building on the social sharing of emotions theory and social support theory, we identified important features that can potentially predict substance use. Then, we used natural language processing techniques to extract such features from social media conversations and reactions and built a series of machine learning models to predict participants? marijuana use. Results: We evaluated our models based on their predictive performance as well as their conformity with measures of fairness. Without predictive features from survey information, which may introduce sex and racial biases, our machine learning models can reach an area under the curve of 0.72 and an accuracy of 0.81 using only social media data when predicting marijuana use. We also evaluated the false-positive rate for each sex and age segment. Conclusions: We showed that textual interactions among youth experiencing homelessness and their friends on social media can serve as a powerful resource to predict their substance use. The framework we developed allows care providers to allocate resources efficiently to youth experiencing homelessness in the greatest need while costing minimal overhead. It can be extended to analyze and predict other health-related behaviors and conditions observed in this vulnerable community. UR - https://ai.jmir.org/2024/1/e53488 UR - http://dx.doi.org/10.2196/53488 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/53488 ER - TY - JOUR AU - O'Malley, Andrew AU - Veenhuizen, Miriam AU - Ahmed, Ayla PY - 2024/11/27 TI - Ensuring Appropriate Representation in Artificial Intelligence?Generated Medical Imagery: Protocol for a Methodological Approach to Address Skin Tone Bias JO - JMIR AI SP - e58275 VL - 3 KW - artificial intelligence KW - generative AI KW - AI images KW - dermatology KW - anatomy KW - medical education KW - medical imaging KW - skin KW - skin tone KW - United States KW - educational material KW - psoriasis KW - digital imagery N2 - Background: In medical education, particularly in anatomy and dermatology, generative artificial intelligence (AI) can be used to create customized illustrations. However, the underrepresentation of darker skin tones in medical textbooks and elsewhere, which serve as training data for AI, poses a significant challenge in ensuring diverse and inclusive educational materials. Objective: This study aims to evaluate the extent of skin tone diversity in AI-generated medical images and to test whether the representation of skin tones can be improved by modifying AI prompts to better reflect the demographic makeup of the US population. Methods: In total, 2 standard AI models (Dall-E [OpenAI] and Midjourney [Midjourney Inc]) each generated 100 images of people with psoriasis. In addition, a custom model was developed that incorporated a prompt injection aimed at ?forcing? the AI (Dall-E 3) to reflect the skin tone distribution of the US population according to the 2012 American National Election Survey. This custom model generated another set of 100 images. The skin tones in these images were assessed by 3 researchers using the New Immigrant Survey skin tone scale, with the median value representing each image. A chi-square goodness of fit analysis compared the skin tone distributions from each set of images to that of the US population. Results: The standard AI models (Dalle-3 and Midjourney) demonstrated a significant difference between the expected skin tones of the US population and the observed tones in the generated images (P<.001). Both standard AI models overrepresented lighter skin. Conversely, the custom model with the modified prompt yielded a distribution of skin tones that closely matched the expected demographic representation, showing no significant difference (P=.04). Conclusions: This study reveals a notable bias in AI-generated medical images, predominantly underrepresenting darker skin tones. This bias can be effectively addressed by modifying AI prompts to incorporate real-life demographic distributions. The findings emphasize the need for conscious efforts in AI development to ensure diverse and representative outputs, particularly in educational and medical contexts. Users of generative AI tools should be aware that these biases exist, and that similar tendencies may also exist in other types of generative AI (eg, large language models) and in other characteristics (eg, sex, gender, culture, and ethnicity). Injecting demographic data into AI prompts may effectively counteract these biases, ensuring a more accurate representation of the general population. UR - https://ai.jmir.org/2024/1/e58275 UR - http://dx.doi.org/10.2196/58275 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/58275 ER - TY - JOUR AU - Dahu, M. Butros AU - Khan, Solaiman AU - Toubal, Eddine Imad AU - Alshehri, Mariam AU - Martinez-Villar, I. Carlos AU - Ogundele, B. Olabode AU - Sheets, R. Lincoln AU - Scott, J. Grant PY - 2024/12/17 TI - Geospatial Modeling of Deep Neural Visual Features for Predicting Obesity Prevalence in Missouri: Quantitative Study JO - JMIR AI SP - e64362 VL - 3 KW - geospatial modeling KW - deep convolutional neural network KW - DCNN KW - Residual Network-50 KW - ResNet-50 KW - satellite imagery KW - Moran I KW - local indicators of spatial association KW - LISA KW - spatial lag model KW - obesity rate KW - artificial intelligence KW - AI N2 - Background: The global obesity epidemic demands innovative approaches to understand its complex environmental and social determinants. Spatial technologies, such as geographic information systems, remote sensing, and spatial machine learning, offer new insights into this health issue. This study uses deep learning and spatial modeling to predict obesity rates for census tracts in Missouri. Objective: This study aims to develop a scalable method for predicting obesity prevalence using deep convolutional neural networks applied to satellite imagery and geospatial analysis, focusing on 1052 census tracts in Missouri. Methods: Our analysis followed 3 steps. First, Sentinel-2 satellite images were processed using the Residual Network-50 model to extract environmental features from 63,592 image chips (224×224 pixels). Second, these features were merged with obesity rate data from the Centers for Disease Control and Prevention for Missouri census tracts. Third, a spatial lag model was used to predict obesity rates and analyze the association between deep neural visual features and obesity prevalence. Spatial autocorrelation was used to identify clusters of obesity rates. Results: Substantial spatial clustering of obesity rates was found across Missouri, with a Moran I value of 0.68, indicating similar obesity rates among neighboring census tracts. The spatial lag model demonstrated strong predictive performance, with an R2 of 0.93 and a spatial pseudo R2 of 0.92, explaining 93% of the variation in obesity rates. Local indicators from a spatial association analysis revealed regions with distinct high and low clusters of obesity, which were visualized through choropleth maps. Conclusions: This study highlights the effectiveness of integrating deep convolutional neural networks and spatial modeling to predict obesity prevalence based on environmental features from satellite imagery. The model?s high accuracy and ability to capture spatial patterns offer valuable insights for public health interventions. Future work should expand the geographical scope and include socioeconomic data to further refine the model for broader applications in obesity research. UR - https://ai.jmir.org/2024/1/e64362 UR - http://dx.doi.org/10.2196/64362 UR - http://www.ncbi.nlm.nih.gov/pubmed/39688897 ID - info:doi/10.2196/64362 ER - TY - JOUR AU - Jordan, Alexis AU - Park, Albert PY - 2024/6/3 TI - Understanding the Long Haulers of COVID-19: Mixed Methods Analysis of YouTube Content JO - JMIR AI SP - e54501 VL - 3 KW - long haulers KW - post?COVID-19 condition KW - COVID-19 KW - YouTube KW - topic modeling KW - natural language processing N2 - Background: The COVID-19 pandemic had a devastating global impact. In the United States, there were >98 million COVID-19 cases and >1 million resulting deaths. One consequence of COVID-19 infection has been post?COVID-19 condition (PCC). People with this syndrome, colloquially called long haulers, experience symptoms that impact their quality of life. The root cause of PCC and effective treatments remains unknown. Many long haulers have turned to social media for support and guidance. Objective: In this study, we sought to gain a better understanding of the long hauler experience by investigating what has been discussed and how information about long haulers is perceived on social media. We specifically investigated the following: (1) the range of symptoms that are discussed, (2) the ways in which information about long haulers is perceived, (3) informational and emotional support that is available to long haulers, and (4) discourse between viewers and creators. We selected YouTube as our data source due to its popularity and wide range of audience. Methods: We systematically gathered data from 3 different types of content creators: medical sources, news sources, and long haulers. To computationally understand the video content and viewers? reactions, we used Biterm, a topic modeling algorithm created specifically for short texts, to analyze snippets of video transcripts and all top-level comments from the comment section. To triangulate our findings about viewers? reactions, we used the Valence Aware Dictionary and Sentiment Reasoner to conduct sentiment analysis on comments from each type of content creator. We grouped the comments into positive and negative categories and generated topics for these groups using Biterm. We then manually grouped resulting topics into broader themes for the purpose of analysis. Results: We organized the resulting topics into 28 themes across all sources. Examples of medical source transcript themes were Explanations in layman?s terms and Biological explanations. Examples of news source transcript themes were Negative experiences and handling the long haul. The 2 long hauler transcript themes were Taking treatments into own hands and Changes to daily life. News sources received a greater share of negative comments. A few themes of these negative comments included Misinformation and disinformation and Issues with the health care system. Similarly, negative long hauler comments were organized into several themes, including Disillusionment with the health care system and Requiring more visibility. In contrast, positive medical source comments captured themes such as Appreciation of helpful content and Exchange of helpful information. In addition to this theme, one positive theme found in long hauler comments was Community building. Conclusions: The results of this study could help public health agencies, policy makers, organizations, and health researchers understand symptomatology and experiences related to PCC. They could also help these agencies develop their communication strategy concerning PCC. UR - https://ai.jmir.org/2024/1/e54501 UR - http://dx.doi.org/10.2196/54501 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875666 ID - info:doi/10.2196/54501 ER - TY - JOUR AU - Weidener, Lukas AU - Fischer, Michael PY - 2024/1/12 TI - Role of Ethics in Developing AI-Based Applications in Medicine: Insights From Expert Interviews and Discussion of Implications JO - JMIR AI SP - e51204 VL - 3 KW - artificial intelligence KW - AI KW - medicine KW - ethics KW - expert interviews KW - AI development KW - AI ethics N2 - Background: The integration of artificial intelligence (AI)?based applications in the medical field has increased significantly, offering potential improvements in patient care and diagnostics. However, alongside these advancements, there is growing concern about ethical considerations, such as bias, informed consent, and trust in the development of these technologies. Objective: This study aims to assess the role of ethics in the development of AI-based applications in medicine. Furthermore, this study focuses on the potential consequences of neglecting ethical considerations in AI development, particularly their impact on patients and physicians. Methods: Qualitative content analysis was used to analyze the responses from expert interviews. Experts were selected based on their involvement in the research or practical development of AI-based applications in medicine for at least 5 years, leading to the inclusion of 7 experts in the study. Results: The analysis revealed 3 main categories and 7 subcategories reflecting a wide range of views on the role of ethics in AI development. This variance underscores the subjectivity and complexity of integrating ethics into the development of AI in medicine. Although some experts view ethics as fundamental, others prioritize performance and efficiency, with some perceiving ethics as potential obstacles to technological progress. This dichotomy of perspectives clearly emphasizes the subjectivity and complexity surrounding the role of ethics in AI development, reflecting the inherent multifaceted nature of this issue. Conclusions: Despite the methodological limitations impacting the generalizability of the results, this study underscores the critical importance of consistent and integrated ethical considerations in AI development for medical applications. It advocates further research into effective strategies for ethical AI development, emphasizing the need for transparent and responsible practices, consideration of diverse data sources, physician training, and the establishment of comprehensive ethical and legal frameworks. UR - https://ai.jmir.org/2024/1/e51204 UR - http://dx.doi.org/10.2196/51204 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875585 ID - info:doi/10.2196/51204 ER - TY - JOUR AU - Hansen, Steffan AU - Brandt, Joakim Carl AU - Sųndergaard, Jens PY - 2024/1/22 TI - Beyond the Hype?The Actual Role and Risks of AI in Today?s Medical Practice: Comparative-Approach Study JO - JMIR AI SP - e49082 VL - 3 KW - AI KW - artificial intelligence KW - ChatGPT-4 KW - Microsoft Bing KW - general practice KW - ChatGPT KW - chatbot KW - chatbots KW - writing KW - academic KW - academia KW - Bing N2 - Background: The evolution of artificial intelligence (AI) has significantly impacted various sectors, with health care witnessing some of its most groundbreaking contributions. Contemporary models, such as ChatGPT-4 and Microsoft Bing, have showcased capabilities beyond just generating text, aiding in complex tasks like literature searches and refining web-based queries. Objective: This study explores a compelling query: can AI author an academic paper independently? Our assessment focuses on four core dimensions: relevance (to ensure that AI?s response directly addresses the prompt), accuracy (to ascertain that AI?s information is both factually correct and current), clarity (to examine AI?s ability to present coherent and logical ideas), and tone and style (to evaluate whether AI can align with the formality expected in academic writings). Additionally, we will consider the ethical implications and practicality of integrating AI into academic writing. Methods: To assess the capabilities of ChatGPT-4 and Microsoft Bing in the context of academic paper assistance in general practice, we used a systematic approach. ChatGPT-4, an advanced AI language model by Open AI, excels in generating human-like text and adapting responses based on user interactions, though it has a knowledge cut-off in September 2021. Microsoft Bing's AI chatbot facilitates user navigation on the Bing search engine, offering tailored search Results: In terms of relevance, ChatGPT-4 delved deeply into AI?s health care role, citing academic sources and discussing diverse applications and concerns, while Microsoft Bing provided a concise, less detailed overview. In terms of accuracy, ChatGPT-4 correctly cited 72% (23/32) of its peer-reviewed articles but included some nonexistent references. Microsoft Bing?s accuracy stood at 46% (6/13), supplemented by relevant non?peer-reviewed articles. In terms of clarity, both models conveyed clear, coherent text. ChatGPT-4 was particularly adept at detailing technical concepts, while Microsoft Bing was more general. In terms of tone, both models maintained an academic tone, but ChatGPT-4 exhibited superior depth and breadth in content delivery. Conclusions: Comparing ChatGPT-4 and Microsoft Bing for academic assistance revealed strengths and limitations. ChatGPT-4 excels in depth and relevance but falters in citation accuracy. Microsoft Bing is concise but lacks robust detail. Though both models have potential, neither can independently handle comprehensive academic tasks. As AI evolves, combining ChatGPT-4?s depth with Microsoft Bing?s up-to-date referencing could optimize academic support. Researchers should critically assess AI outputs to maintain academic credibility. UR - https://ai.jmir.org/2024/1/e49082 UR - http://dx.doi.org/10.2196/49082 UR - http://www.ncbi.nlm.nih.gov/pubmed/ ID - info:doi/10.2196/49082 ER - TY - JOUR AU - Lu, Jiahui AU - Zhang, Huibin AU - Xiao, Yi AU - Wang, Yingyu PY - 2024/1/29 TI - An Environmental Uncertainty Perception Framework for Misinformation Detection and Spread Prediction in the COVID-19 Pandemic: Artificial Intelligence Approach JO - JMIR AI SP - e47240 VL - 3 KW - misinformation detection KW - misinformation spread prediction KW - uncertainty KW - COVID-19 KW - information environment N2 - Background: Amidst the COVID-19 pandemic, misinformation on social media has posed significant threats to public health. Detecting and predicting the spread of misinformation are crucial for mitigating its adverse effects. However, prevailing frameworks for these tasks have predominantly focused on post-level signals of misinformation, neglecting features of the broader information environment where misinformation originates and proliferates. Objective: This study aims to create a novel framework that integrates the uncertainty of the information environment into misinformation features, with the goal of enhancing the model?s accuracy in tasks such as misinformation detection and predicting the scale of dissemination. The objective is to provide better support for online governance efforts during health crises. Methods: In this study, we embraced uncertainty features within the information environment and introduced a novel Environmental Uncertainty Perception (EUP) framework for the detection of misinformation and the prediction of its spread on social media. The framework encompasses uncertainty at 4 scales of the information environment: physical environment, macro-media environment, micro-communicative environment, and message framing. We assessed the effectiveness of the EUP using real-world COVID-19 misinformation data sets. Results: The experimental results demonstrated that the EUP alone achieved notably good performance, with detection accuracy at 0.753 and prediction accuracy at 0.71. These results were comparable to state-of-the-art baseline models such as bidirectional long short-term memory (BiLSTM; detection accuracy 0.733 and prediction accuracy 0.707) and bidirectional encoder representations from transformers (BERT; detection accuracy 0.755 and prediction accuracy 0.728). Additionally, when the baseline models collaborated with the EUP, they exhibited improved accuracy by an average of 1.98% for the misinformation detection and 2.4% for spread-prediction tasks. On unbalanced data sets, the EUP yielded relative improvements of 21.5% and 5.7% in macro-F1-score and area under the curve, respectively. Conclusions: This study makes a significant contribution to the literature by recognizing uncertainty features within information environments as a crucial factor for improving misinformation detection and spread-prediction algorithms during the pandemic. The research elaborates on the complexities of uncertain information environments for misinformation across 4 distinct scales, including the physical environment, macro-media environment, micro-communicative environment, and message framing. The findings underscore the effectiveness of incorporating uncertainty into misinformation detection and spread prediction, providing an interdisciplinary and easily implementable framework for the field. UR - https://ai.jmir.org/2024/1/e47240 UR - http://dx.doi.org/10.2196/47240 UR - http://www.ncbi.nlm.nih.gov/pubmed/38875583 ID - info:doi/10.2196/47240 ER - TY - JOUR AU - Lorenzini, Giorgia AU - Arbelaez Ossa, Laura AU - Milford, Stephen AU - Elger, Simone Bernice AU - Shaw, Martin David AU - De Clercq, Eva PY - 2024/8/19 TI - The ?Magical Theory? of AI in Medicine: Thematic Narrative Analysis JO - JMIR AI SP - e49795 VL - 3 KW - artificial intelligence KW - medicine KW - physicians KW - hype KW - narratives KW - qualitative research N2 - Background: The discourse surrounding medical artificial intelligence (AI) often focuses on narratives that either hype the technology?s potential or predict dystopian futures. AI narratives have a significant influence on the direction of research, funding, and public opinion and thus shape the future of medicine. Objective: The paper aims to offer critical reflections on AI narratives, with a specific focus on medical AI, and to raise awareness as to how people working with medical AI talk about AI and discharge their ?narrative responsibility.? Methods: Qualitative semistructured interviews were conducted with 41 participants from different disciplines who were exposed to medical AI in their profession. The research represents a secondary analysis of data using a thematic narrative approach. The analysis resulted in 2 main themes, each with 2 other subthemes. Results: Stories about the AI-physician interaction depicted either a competitive or collaborative relationship. Some participants argued that AI might replace physicians, as it performs better than physicians. However, others believed that physicians should not be replaced and that AI should rather assist and support physicians. The idea of excessive technological deferral and automation bias was discussed, highlighting the risk of ?losing? decisional power. The possibility that AI could relieve physicians from burnout and allow them to spend more time with patients was also considered. Finally, a few participants reported an extremely optimistic account of medical AI, while the majority criticized this type of story. The latter lamented the existence of a ?magical theory? of medical AI, identified with techno-solutionist positions. Conclusions: Most of the participants reported a nuanced view of technology, recognizing both its benefits and challenges and avoiding polarized narratives. However, some participants did contribute to the hype surrounding medical AI, comparing it to human capabilities and depicting it as superior. Overall, the majority agreed that medical AI should assist rather than replace clinicians. The study concludes that a balanced narrative (that focuses on the technology?s present capabilities and limitations) is necessary to fully realize the potential of medical AI while avoiding unrealistic expectations and hype. UR - https://ai.jmir.org/2024/1/e49795 UR - http://dx.doi.org/10.2196/49795 UR - http://www.ncbi.nlm.nih.gov/pubmed/39158953 ID - info:doi/10.2196/49795 ER -