@Article{info:doi/10.2196/62985, author="Loh, Rong De and Hill, D. Elliot and Liu, Nan and Dawson, Geraldine and Engelhard, M. Matthew", title="Limitations of Binary Classification for Long-Horizon Diagnosis Prediction and Advantages of a Discrete-Time Time-to-Event Approach: Empirical Analysis", journal="JMIR AI", year="2025", month="Mar", day="27", volume="4", pages="e62985", keywords="machine learning", keywords="artificial intelligence", keywords="deep learning", keywords="predictive models", keywords="practical models", keywords="early detection", keywords="electronic health records", keywords="right-censoring", keywords="survival analysis", keywords="distributional shifts", abstract="Background: A major challenge in using electronic health records (EHR) is the inconsistency of patient follow-up, resulting in right-censored outcomes. This becomes particularly problematic in long-horizon event predictions, such as autism and attention-deficit/hyperactivity disorder (ADHD) diagnoses, where a significant number of patients are lost to follow-up before the outcome can be observed. Consequently, fully supervised methods such as binary classification (BC), which are trained to predict observed diagnoses, are substantially affected by the probability of sufficient follow-up, leading to biased results. Objective: This empirical analysis aims to characterize BC's inherent limitations for long-horizon diagnosis prediction from EHR; and quantify the benefits of a specific time-to-event (TTE) approach, the discrete-time neural network (DTNN). Methods: Records within the Duke University Health System EHR were analyzed, extracting features such as ICD-10 (International Classification of Diseases, Tenth Revision) diagnosis codes, medications, laboratories, and procedures. We compared a DTNN to 3 BC approaches and a deep Cox proportional hazards model across 4 clinical conditions to examine distributional patterns across various subgroups. Time-varying area under the receiving operating characteristic curve (AUCt) and time-varying average precision (APt) were our primary evaluation metrics. Results: TTE models consistently had comparable or higher AUCt and APt than BC for all conditions. At clinically relevant operating time points, the area under the receiving operating characteristic curve (AUC) values for DTNNYOB?2020 (year-of-birth) and DCPHYOB?2020 (deep Cox proportional hazard) were 0.70 (95\% CI 0.66?0.77) and 0.72 (95\% CI 0.66?0.78) at t=5 for autism, 0.72 (95\% CI 0.65?0.76) and 0.68 (95\% CI 0.62?0.74) at t=7 for ADHD, 0.72 (95\% CI 0.70?0.75) and 0.71 (95\% CI 0.69?0.74) at t=1 for recurrent otitis media, and 0.74 (95\% CI 0.68?0.82) and 0.71 (95\% CI 0.63?0.77) at t=1 for food allergy, compared to 0.6 (95\% CI 0.55?0.66), 0.47 (95\% CI 0.40?0.54), 0.73 (95\% CI 0.70?0.75), and 0.77 (95\% CI 0.71?0.82) for BCYOB?2020, respectively. The probabilities predicted by BC models were positively correlated with censoring times, particularly for autism and ADHD prediction. Filtering strategies based on YOB or length of follow-up only partially corrected these biases. In subgroup analyses, only DTNN predicted diagnosis probabilities that accurately reflect actual clinical prevalence and temporal trends. Conclusions: BC models substantially underpredicted diagnosis likelihood and inappropriately assigned lower probability scores to individuals with earlier censoring. Common filtering strategies did not adequately address this limitation. TTE approaches, particularly DTNN, effectively mitigated bias from the censoring distribution, resulting in superior discrimination and calibration performance and more accurate prediction of clinical prevalence. Machine learning practitioners should recognize the limitations of BC for long-horizon diagnosis prediction and adopt TTE approaches. The DTNN in particular is well-suited to mitigate the effects of right-censoring and maximize prediction performance in this setting. ", doi="10.2196/62985", url="https://ai.jmir.org/2025/1/e62985" } @Article{info:doi/10.2196/65456, author="Helgeson, A. Scott and Quicksall, S. Zachary and Johnson, W. Patrick and Lim, G. Kaiser and Carter, E. Rickey and Lee, S. Augustine", title="Estimation of Static Lung Volumes and Capacities From Spirometry Using Machine Learning: Algorithm Development and Validation", journal="JMIR AI", year="2025", month="Mar", day="24", volume="4", pages="e65456", keywords="artificial intelligence", keywords="machine learning", keywords="pulmonary function test", keywords="spirometry", keywords="total lung capacity", keywords="AI", keywords="ML", keywords="lung", keywords="lung volume", keywords="lung capacity", keywords="spirometer", keywords="lung disease", keywords="database", keywords="respiratory", keywords="pulmonary", abstract="Background: Spirometry can be performed in an office setting or remotely using portable spirometers. Although basic spirometry is used for diagnosis of obstructive lung disease, clinically relevant information such as restriction, hyperinflation, and air trapping require additional testing, such as body plethysmography, which is not as readily available. We hypothesize that spirometry data contains information that can allow estimation of static lung volumes in certain circumstances by leveraging machine learning techniques. Objective: The aim of the study was to develop artificial intelligence-based algorithms for estimating lung volumes and capacities using spirometry measures. Methods: This study obtained spirometry and lung volume measurements from the Mayo Clinic pulmonary function test database for patient visits between February 19, 2001, and December 16, 2022. Preprocessing was performed, and various machine learning algorithms were applied, including a generalized linear model with regularization, random forests, extremely randomized trees, gradient-boosted trees, and XGBoost for both classification and regression cohorts. Results: A total of 121,498 pulmonary function tests were used in this study, with 85,017 allotted for exploratory data analysis and model development (ie, training dataset) and 36,481 tests reserved for model evaluation (ie, testing dataset). The median age of the cohort was 64.7 years (IQR 18?119.6), with a balanced distribution between genders, consisting 48.2\% (n=58,607) female and 51.8\% (n=62,889) male patients. The classification models showed a robust performance overall, with relatively low root mean square error and mean absolute error values observed across all predicted lung volumes. Across all lung volume categories, the models demonstrated strong discriminatory capacity, as indicated by the high area under the receiver operating characteristic curve values ranging from 0.85 to 0.99 in the training set and 0.81 to 0.98 in the testing set. Conclusions: Overall, the models demonstrate robust performance across lung volume measurements, underscoring their potential utility in clinical practice for accurate diagnosis and prognosis of respiratory conditions, particularly in settings where access to body plethysmography or other lung volume measurement modalities is limited. ", doi="10.2196/65456", url="https://ai.jmir.org/2025/1/e65456" } @Article{info:doi/10.2196/70222, author="Andalib, Saman and Spina, Aidin and Picton, Bryce and Solomon, S. Sean and Scolaro, A. John and Nelson, M. Ariana", title="Using AI to Translate and Simplify Spanish Orthopedic Medical Text: Instrument Validation Study", journal="JMIR AI", year="2025", month="Mar", day="21", volume="4", pages="e70222", keywords="large language models", keywords="LLM", keywords="patient education", keywords="translation", keywords="bilingual evaluation understudy", keywords="GPT-4", keywords="Google Translate", abstract="Background: Language barriers contribute significantly to health care disparities in the United States, where a sizable proportion of patients are exclusively Spanish speakers. In orthopedic surgery, such barriers impact both patients' comprehension of and patients' engagement with available resources. Studies have explored the utility of large language models (LLMs) for medical translation but have yet to robustly evaluate artificial intelligence (AI)--driven translation and simplification of orthopedic materials for Spanish speakers. Objective: This study used the bilingual evaluation understudy (BLEU) method to assess translation quality and investigated the ability of AI to simplify patient education materials (PEMs) in Spanish. Methods: PEMs (n=78) from the American Academy of Orthopaedic Surgery were translated from English to Spanish, using 2 LLMs (GPT-4 and Google Translate). The BLEU methodology was applied to compare AI translations with professionally human-translated PEMs. The Friedman test and Dunn multiple comparisons test were used to statistically quantify differences in translation quality. A readability analysis and feature analysis were subsequently performed to evaluate text simplification success and the impact of English text features on BLEU scores. The capability of an LLM to simplify medical language written in Spanish was also assessed. Results: As measured by BLEU scores, GPT-4 showed moderate success in translating PEMs into Spanish but was less successful than Google Translate. Simplified PEMs demonstrated improved readability when compared to original versions (P<.001) but were unable to reach the targeted grade level for simplification. The feature analysis revealed that the total number of syllables and average number of syllables per sentence had the highest impact on BLEU scores. GPT-4 was able to significantly reduce the complexity of medical text written in Spanish (P<.001). Conclusions: Although Google Translate outperformed GPT-4 in translation accuracy, LLMs, such as GPT-4, may provide significant utility in translating medical texts into Spanish and simplifying such texts. We recommend considering a dual approach---using Google Translate for translation and GPT-4 for simplification---to improve medical information accessibility and orthopedic surgery education among Spanish-speaking patients. ", doi="10.2196/70222", url="https://ai.jmir.org/2025/1/e70222" } @Article{info:doi/10.2196/67239, author="Tzeng, Jing-Tong and Li, Jeng-Lin and Chen, Huan-Yu and Huang, Chu-Hsiang and Chen, Chi-Hsin and Fan, Cheng-Yi and Huang, Pei-Chuan Edward and Lee, Chi-Chun", title="Improving the Robustness and Clinical Applicability of Automatic Respiratory Sound Classification Using Deep Learning--Based Audio Enhancement: Algorithm Development and Validation", journal="JMIR AI", year="2025", month="Mar", day="13", volume="4", pages="e67239", keywords="respiratory sound", keywords="lung sound", keywords="audio enhancement", keywords="noise robustness", keywords="clinical applicability", keywords="artificial intelligence", keywords="AI", abstract="Background: Deep learning techniques have shown promising results in the automatic classification of respiratory sounds. However, accurately distinguishing these sounds in real-world noisy conditions poses challenges for clinical deployment. In addition, predicting signals with only background noise could undermine user trust in the system. Objective: This study aimed to investigate the feasibility and effectiveness of incorporating a deep learning--based audio enhancement preprocessing step into automatic respiratory sound classification systems to improve robustness and clinical applicability. Methods: We conducted extensive experiments using various audio enhancement model architectures, including time-domain and time-frequency--domain approaches, in combination with multiple classification models to evaluate the effectiveness of the audio enhancement module in an automatic respiratory sound classification system. The classification performance was compared against the baseline noise injection data augmentation method. These experiments were carried out on 2 datasets: the International Conference in Biomedical and Health Informatics (ICBHI) respiratory sound dataset, which contains 5.5 hours of recordings, and the Formosa Archive of Breath Sound dataset, which comprises 14.6 hours of recordings. Furthermore, a physician validation study involving 7 senior physicians was conducted to assess the clinical utility of the system. Results: The integration of the audio enhancement module resulted in a 21.88\% increase with P<.001 in the ICBHI classification score on the ICBHI dataset and a 4.1\% improvement with P<.001 on the Formosa Archive of Breath Sound dataset in multi-class noisy scenarios. Quantitative analysis from the physician validation study revealed improvements in efficiency, diagnostic confidence, and trust during model-assisted diagnosis, with workflows that integrated enhanced audio leading to an 11.61\% increase in diagnostic sensitivity and facilitating high-confidence diagnoses. Conclusions: Incorporating an audio enhancement algorithm significantly enhances the robustness and clinical utility of automatic respiratory sound classification systems, improving performance in noisy environments and fostering greater trust among medical professionals. ", doi="10.2196/67239", url="https://ai.jmir.org/2025/1/e67239" } @Article{info:doi/10.2196/64279, author="Rajaram, Akshay and Judd, Michael and Barber, David", title="Deep Learning Models to Predict Diagnostic and Billing Codes Following Visits to a Family Medicine Practice: Development and Validation Study", journal="JMIR AI", year="2025", month="Mar", day="7", volume="4", pages="e64279", keywords="machine learning", keywords="ML", keywords="artificial intelligence", keywords="algorithm", keywords="predictive model", keywords="predictive analytics", keywords="predictive system", keywords="family medicine", keywords="primary care", keywords="family doctor", keywords="family physician", keywords="income", keywords="billing code", keywords="electronic notes", keywords="electronic health record", keywords="electronic medical record", keywords="EMR", keywords="patient record", keywords="health record", keywords="personal health record", abstract="Background: Despite significant time spent on billing, family physicians routinely make errors and miss billing opportunities. In other disciplines, machine learning models have predicted Current Procedural Terminology codes with high accuracy. Objective: Our objective was to derive machine learning models capable of predicting diagnostic and billing codes from notes recorded in the electronic medical record. Methods: We conducted a retrospective algorithm development and validation study involving an academic family medicine practice. Visits between July 1, 2015, and June 30, 2020, containing a physician-authored note and an invoice in the electronic medical record were eligible for inclusion. We trained 2 deep learning models and compared their predictions to codes submitted for reimbursement. We calculated accuracy, recall, precision, F1-score, and area under the receiver operating characteristic curve. Results: Of the 245,045 visits eligible for inclusion, 198,802 (81\%) were included in model development. Accuracy was 99.8\% and 99.5\% for the diagnostic and billing code models, respectively. Recall was 49.4\% and 70.3\% for the diagnostic and billing code models, respectively. Precision was 55.3\% and 76.7\% for the diagnostic and billing code models, respectively. The area under the receiver operating characteristic curve was 0.983 for the diagnostic code model and 0.993 for the billing code model. Conclusions: We developed models capable of predicting diagnostic and billing codes from electronic notes following visits to a family medicine practice. The billing code model outperformed the diagnostic code model in terms of recall and precision, likely due to fewer codes being predicted. Work is underway to further enhance model performance and assess the generalizability of these models to other family medicine practices. ", doi="10.2196/64279", url="https://ai.jmir.org/2025/1/e64279" } @Article{info:doi/10.2196/55059, author="Ravaut, Mathieu and Zhao, Ruochen and Phung, Duy and Qin, Mengqi Vicky and Milovanovic, Dusan and Pienkowska, Anita and Bojic, Iva and Car, Josip and Joty, Shafiq", title="Targeting COVID-19 and Human Resources for Health News Information Extraction: Algorithm Development and Validation", journal="JMIR AI", year="2024", month="Oct", day="30", volume="3", pages="e55059", keywords="COVID-19", keywords="SARS-CoV-2", keywords="summary", keywords="summarize", keywords="news articles", keywords="deep learning", keywords="classification", keywords="summarization", keywords="machine learning", keywords="extract", keywords="extraction", keywords="news", keywords="media", keywords="NLP", keywords="natural language processing", abstract="Background: Global pandemics like COVID-19 put a high amount of strain on health care systems and health workers worldwide. These crises generate a vast amount of news information published online across the globe. This extensive corpus of articles has the potential to provide valuable insights into the nature of ongoing events and guide interventions and policies. However, the sheer volume of information is beyond the capacity of human experts to process and analyze effectively. Objective: The aim of this study was to explore how natural language processing (NLP) can be leveraged to build a system that allows for quick analysis of a high volume of news articles. Along with this, the objective was to create a workflow comprising human-computer symbiosis to derive valuable insights to support health workforce strategic policy dialogue, advocacy, and decision-making. Methods: We conducted a review of open-source news coverage from January 2020 to June 2022 on COVID-19 and its impacts on the health workforce from the World Health Organization (WHO) Epidemic Intelligence from Open Sources (EIOS) by synergizing NLP models, including classification and extractive summarization, and human-generated analyses. Our DeepCovid system was trained on 2.8 million news articles in English from more than 3000 internet sources across hundreds of jurisdictions. Results: Rules-based classification with hand-designed rules narrowed the data set to 8508 articles with high relevancy confirmed in the human-led evaluation. DeepCovid's automated information targeting component reached a very strong binary classification performance of 98.98 for the area under the receiver operating characteristic curve (ROC-AUC) and 47.21 for the area under the precision recall curve (PR-AUC). Its information extraction component attained good performance in automatic extractive summarization with a mean Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score of 47.76. DeepCovid's final summaries were used by human experts to write reports on the COVID-19 pandemic. Conclusions: It is feasible to synergize high-performing NLP models and human-generated analyses to benefit open-source health workforce intelligence. The DeepCovid approach can contribute to an agile and timely global view, providing complementary information to scientific literature. ", doi="10.2196/55059", url="https://ai.jmir.org/2024/1/e55059" } @Article{info:doi/10.2196/56590, author="Chen, Anjun and Wu, Erman and Huang, Ran and Shen, Bairong and Han, Ruobing and Wen, Jian and Zhang, Zhiyong and Li, Qinghua", title="Development of Lung Cancer Risk Prediction Machine Learning Models for Equitable Learning Health System: Retrospective Study", journal="JMIR AI", year="2024", month="Sep", day="11", volume="3", pages="e56590", keywords="lung cancer", keywords="risk prediction", keywords="early detection", keywords="learning health system", keywords="LHS", keywords="machine learning", keywords="ML", keywords="artificial intelligence", keywords="AI", keywords="predictive model", abstract="Background: A significant proportion of young at-risk patients and nonsmokers are excluded by the current guidelines for lung cancer (LC) screening, resulting in low-screening adoption. The vision of the US National Academy of Medicine to transform health systems into learning health systems (LHS) holds promise for bringing necessary structural changes to health care, thereby addressing the exclusivity and adoption issues of LC screening. Objective: This study aims to realize the LHS vision by designing an equitable, machine learning (ML)--enabled LHS unit for LC screening. It focuses on developing an inclusive and practical LC risk prediction model, suitable for initializing the ML-enabled LHS (ML-LHS) unit. This model aims to empower primary physicians in a clinical research network, linking central hospitals and rural clinics, to routinely deliver risk-based screening for enhancing LC early detection in broader populations. Methods: We created a standardized data set of health factors from 1397 patients with LC and 1448 control patients, all aged 30 years and older, including both smokers and nonsmokers, from a hospital's electronic medical record system. Initially, a data-centric ML approach was used to create inclusive ML models for risk prediction from all available health factors. Subsequently, a quantitative distribution of LC health factors was used in feature engineering to refine the models into a more practical model with fewer variables. Results: The initial inclusive 250-variable XGBoost model for LC risk prediction achieved performance metrics of 0.86 recall, 0.90 precision, and 0.89 accuracy. Post feature refinement, a practical 29-variable XGBoost model was developed, displaying performance metrics of 0.80 recall, 0.82 precision, and 0.82 accuracy. This model met the criteria for initializing the ML-LHS unit for risk-based, inclusive LC screening within clinical research networks. Conclusions: This study designed an innovative ML-LHS unit for a clinical research network, aiming to sustainably provide inclusive LC screening to all at-risk populations. It developed an inclusive and practical XGBoost model from hospital electronic medical record data, capable of initializing such an ML-LHS unit for community and rural clinics. The anticipated deployment of this ML-LHS unit is expected to significantly improve LC-screening rates and early detection among broader populations, including those typically overlooked by existing screening guidelines. ", doi="10.2196/56590", url="https://ai.jmir.org/2024/1/e56590", url="http://www.ncbi.nlm.nih.gov/pubmed/39259582" } @Article{info:doi/10.2196/54449, author="Khademi, Sedigh and Palmer, Christopher and Javed, Muhammad and Dimaguila, Luis Gerardo and Clothier, Hazel and Buttery, Jim and Black, Jim", title="Near Real-Time Syndromic Surveillance of Emergency Department Triage Texts Using Natural Language Processing: Case Study in Febrile Convulsion Detection", journal="JMIR AI", year="2024", month="Aug", day="30", volume="3", pages="e54449", keywords="vaccine safety", keywords="immunization", keywords="febrile convulsion", keywords="syndromic surveillance", keywords="emergency department", keywords="natural language processing", abstract="Background: Collecting information on adverse events following immunization from as many sources as possible is critical for promptly identifying potential safety concerns and taking appropriate actions. Febrile convulsions are recognized as an important potential reaction to vaccination in children aged <6 years. Objective: The primary aim of this study was to evaluate the performance of natural language processing techniques and machine learning (ML) models for the rapid detection of febrile convulsion presentations in emergency departments (EDs), especially with respect to the minimum training data requirements to obtain optimum model performance. In addition, we examined the deployment requirements for a ML model to perform real-time monitoring of ED triage notes. Methods: We developed a pattern matching approach as a baseline and evaluated ML models for the classification of febrile convulsions in ED triage notes to determine both their training requirements and their effectiveness in detecting febrile convulsions. We measured their performance during training and then compared the deployed models' result on new incoming ED data. Results: Although the best standard neural networks had acceptable performance and were low-resource models, transformer-based models outperformed them substantially, justifying their ongoing deployment. Conclusions: Using natural language processing, particularly with the use of large language models, offers significant advantages in syndromic surveillance. Large language models make highly effective classifiers, and their text generation capacity can be used to enhance the quality and diversity of training data. ", doi="10.2196/54449", url="https://ai.jmir.org/2024/1/e54449" } @Article{info:doi/10.2196/58455, author="Kamis, Arnold and Gadia, Nidhi and Luo, Zilin and Ng, Xin Shu and Thumbar, Mansi", title="Obtaining the Most Accurate, Explainable Model for Predicting Chronic Obstructive Pulmonary Disease: Triangulation of Multiple Linear Regression and Machine Learning Methods", journal="JMIR AI", year="2024", month="Aug", day="29", volume="3", pages="e58455", keywords="chronic obstructive pulmonary disease", keywords="COPD", keywords="cigarette smoking", keywords="ethnic and racial differences", keywords="machine learning", keywords="multiple linear regression", keywords="household income", keywords="practical model", abstract="Background: Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019. Objective: We gathered a diverse set of non--personally identifiable information from public data sources to better understand and predict COPD rates at the core-based statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD. Methods: We integrated non--personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a well-known contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods. Results: The most accurate multiple linear regression model has variance explained of 81.1\%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7\%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level. Conclusions: This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health. ", doi="10.2196/58455", url="https://ai.jmir.org/2024/1/e58455" } @Article{info:doi/10.2196/52190, author="Patel, Dhavalkumar and Timsina, Prem and Gorenstein, Larisa and Glicksberg, S. Benjamin and Raut, Ganesh and Cheetirala, Narayan Satya and Santana, Fabio and Tamegue, Jules and Kia, Arash and Zimlichman, Eyal and Levin, A. Matthew and Freeman, Robert and Klang, Eyal", title="Traditional Machine Learning, Deep Learning, and BERT (Large Language Model) Approaches for Predicting Hospitalizations From Nurse Triage Notes: Comparative Evaluation of Resource Management", journal="JMIR AI", year="2024", month="Aug", day="27", volume="3", pages="e52190", keywords="Bio-Clinical-BERT", keywords="term frequency--inverse document frequency", keywords="TF-IDF", keywords="health informatics", keywords="patient care", keywords="hospital resource management", keywords="care", keywords="resource management", keywords="management", keywords="language model", keywords="machine learning", keywords="hospitalization", keywords="deep learning", keywords="logistic regression", keywords="retrospective analysis", keywords="training", keywords="large language model", abstract="Background: Predicting hospitalization from nurse triage notes has the potential to augment care. However, there needs to be careful considerations for which models to choose for this goal. Specifically, health systems will have varying degrees of computational infrastructure available and budget constraints. Objective: To this end, we compared the performance of the deep learning, Bidirectional Encoder Representations from Transformers (BERT)--based model, Bio-Clinical-BERT, with a bag-of-words (BOW) logistic regression (LR) model incorporating term frequency--inverse document frequency (TF-IDF). These choices represent different levels of computational requirements. Methods: A retrospective analysis was conducted using data from 1,391,988 patients who visited emergency departments in the Mount Sinai Health System spanning from 2017 to 2022. The models were trained on 4 hospitals' data and externally validated on a fifth hospital's data. Results: The Bio-Clinical-BERT model achieved higher areas under the receiver operating characteristic curve (0.82, 0.84, and 0.85) compared to the BOW-LR-TF-IDF model (0.81, 0.83, and 0.84) across training sets of 10,000; 100,000; and {\textasciitilde}1,000,000 patients, respectively. Notably, both models proved effective at using triage notes for prediction, despite the modest performance gap. Conclusions: Our findings suggest that simpler machine learning models such as BOW-LR-TF-IDF could serve adequately in resource-limited settings. Given the potential implications for patient care and hospital resource management, further exploration of alternative models and techniques is warranted to enhance predictive performance in this critical domain. International Registered Report Identifier (IRRID): RR2-10.1101/2023.08.07.23293699 ", doi="10.2196/52190", url="https://ai.jmir.org/2024/1/e52190" } @Article{info:doi/10.2196/55820, author="Yaseliani, Mohammad and Noor-E-Alam, Md and Hasan, Mahmudul Md", title="Mitigating Sociodemographic Bias in Opioid Use Disorder Prediction: Fairness-Aware Machine Learning Framework", journal="JMIR AI", year="2024", month="Aug", day="20", volume="3", pages="e55820", keywords="opioid use disorder", keywords="fairness and bias", keywords="bias mitigation", keywords="machine learning", keywords="majority voting", abstract="Background: Opioid use disorder (OUD) is a critical public health crisis in the United States, affecting >5.5 million Americans in 2021. Machine learning has been used to predict patient risk of incident OUD. However, little is known about the fairness and bias of these predictive models. Objective: The aims of this study are two-fold: (1) to develop a machine learning bias mitigation algorithm for sociodemographic features and (2) to develop a fairness-aware weighted majority voting (WMV) classifier for OUD prediction. Methods: We used the 2020 National Survey on Drug and Health data to develop a neural network (NN) model using stochastic gradient descent (SGD; NN-SGD) and an NN model using Adam (NN-Adam) optimizers and evaluated sociodemographic bias by comparing the area under the curve values. A bias mitigation algorithm, based on equality of odds, was implemented to minimize disparities in specificity and recall. Finally, a WMV classifier was developed for fairness-aware prediction of OUD. To further analyze bias detection and mitigation, we did a 1-N matching of OUD to non-OUD cases, controlling for socioeconomic variables, and evaluated the performance of the proposed bias mitigation algorithm and WMV classifier. Results: Our bias mitigation algorithm substantially reduced bias with NN-SGD, by 21.66\% for sex, 1.48\% for race, and 21.04\% for income, and with NN-Adam by 16.96\% for sex, 8.87\% for marital status, 8.45\% for working condition, and 41.62\% for race. The fairness-aware WMV classifier achieved a recall of 85.37\% and 92.68\% and an accuracy of 58.85\% and 90.21\% using NN-SGD and NN-Adam, respectively. The results after matching also indicated remarkable bias reduction with NN-SGD and NN-Adam, respectively, as follows: sex (0.14\% vs 0.97\%), marital status (12.95\% vs 10.33\%), working condition (14.79\% vs 15.33\%), race (60.13\% vs 41.71\%), and income (0.35\% vs 2.21\%). Moreover, the fairness-aware WMV classifier achieved high performance with a recall of 100\% and 85.37\% and an accuracy of 73.20\% and 89.38\% using NN-SGD and NN-Adam, respectively. Conclusions: The application of the proposed bias mitigation algorithm shows promise in reducing sociodemographic bias, with the WMV classifier confirming bias reduction and high performance in OUD prediction. ", doi="10.2196/55820", url="https://ai.jmir.org/2024/1/e55820" } @Article{info:doi/10.2196/55840, author="Iwamoto, Hiroki and Nakano, Saki and Tajima, Ryotaro and Kiguchi, Ryo and Yoshida, Yuki and Kitanishi, Yoshitake and Aoki, Yasunori", title="Predicting Workers' Stress: Application of a High-Performance Algorithm Using Working-Style Characteristics", journal="JMIR AI", year="2024", month="Aug", day="2", volume="3", pages="e55840", keywords="high-performance algorithm", keywords="Japan", keywords="questionnaire", keywords="stress prediction model", keywords="teleworking", keywords="wearable device", abstract="Background: Work characteristics, such as teleworking rate, have been studied in relation to stress. However, the use of work-related data to improve a high-performance stress prediction model that suits an individual's lifestyle has not been evaluated. Objective: This study aims to develop a novel, high-performance algorithm to predict an employee's stress among a group of employees with similar working characteristics. Methods: This prospective observational study evaluated participants' responses to web?based questionnaires, including attendance records and data collected using a wearable device. Data spanning 12 weeks (between January 17, 2022, and April 10, 2022) were collected from 194 Shionogi Group employees. Participants wore the Fitbit Charge 4 wearable device, which collected data on daily sleep, activity, and heart rate. Daily work shift data included details of working hours. Weekly questionnaire responses included the K6 questionnaire for depression/anxiety, a behavioral questionnaire, and the number of days lunch was missed. The proposed prediction model used a neighborhood cluster (N=20) with working-style characteristics similar to those of the prediction target person. Data from the previous week predicted stress levels the following week. Three models were compared by selecting appropriate training data: (1) single model, (2) proposed method 1, and (3) proposed method 2. Shapley Additive Explanations (SHAP) were calculated for the top 10 extracted features from the Extreme Gradient Boosting (XGBoost) model to evaluate the amount and contribution direction categorized by teleworking rates (mean): low: <0.2 (more than 4 days/week in office), middle: 0.2 to <0.6 (2 to 4 days/week in office), and high: ?0.6 (less than 2 days/week in office). Results: Data from 190 participants were used, with a teleworking rate ranging from 0\% to 79\%. The area under the curve (AUC) of the proposed method 2 was 0.84 (true positive vs false positive: 0.77 vs 0.26). Among participants with low teleworking rates, most features extracted were related to sleep, followed by activity and work. Among participants with high teleworking rates, most features were related to activity, followed by sleep and work. SHAP analysis showed that for participants with high teleworking rates, skipping lunch, working more/less than scheduled, higher fluctuations in heart rate, and lower mean sleep duration contributed to stress. In participants with low teleworking rates, coming too early or late to work (before/after 9 AM), a higher/lower than mean heart rate, lower fluctuations in heart rate, and burning more/fewer calories than normal contributed to stress. Conclusions: Forming a neighborhood cluster with similar working styles based on teleworking rates and using it as training data improved the prediction performance. The validity of the neighborhood cluster approach is indicated by differences in the contributing features and their contribution directions among teleworking levels. Trial Registration: UMIN UMIN000046394; https://www.umin.ac.jp/ctr/index.htm ", doi="10.2196/55840", url="https://ai.jmir.org/2024/1/e55840", url="http://www.ncbi.nlm.nih.gov/pubmed/39093604" } @Article{info:doi/10.2196/50800, author="Lee, Kyeryoung and Liu, Zongzhi and Mai, Yun and Jun, Tomi and Ma, Meng and Wang, Tongyu and Ai, Lei and Calay, Ediz and Oh, William and Stolovitzky, Gustavo and Schadt, Eric and Wang, Xiaoyan", title="Optimizing Clinical Trial Eligibility Design Using Natural Language Processing Models and Real-World Data: Algorithm Development and Validation", journal="JMIR AI", year="2024", month="Jul", day="29", volume="3", pages="e50800", keywords="natural language processing", keywords="real-world data", keywords="clinical trial eligibility criteria", keywords="eligibility criteria--specific ontology", keywords="clinical trial protocol optimization", keywords="data-driven approach", abstract="Background: Clinical trials are vital for developing new therapies but can also delay drug development. Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines. Natural language processing (NLP) has the potential to achieve these objectives. Objective: This study aims to assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients. This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records using deep learning--based NLP techniques. Methods: We obtained data of 3281 industry-sponsored phase 2 or 3 interventional clinical trials recruiting patients with non--small cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn disease from ClinicalTrials.gov, spanning the period between 2013 and 2020. A customized bidirectional long short-term memory-- and conditional random field--based NLP pipeline was used to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms along with their corresponding values. To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of patients with non--small cell lung cancer (n=2775), curated from the Mount Sinai Health System, as a pilot study. Results: We manually annotated the clinical trial eligibility corpus (485/3281, 14.78\% trials) and constructed an eligibility criteria--specific ontology. Our customized NLP pipeline, developed based on the eligibility criteria--specific ontology that we created through manual annotation, achieved high precision (0.91, range 0.67-1.00) and recall (0.79, range 0.50-1) scores, as well as a high F1-score (0.83, range 0.67-1), enabling the efficient extraction of granular criteria entities and relevant attributes from 3281 clinical trials. A standardized eligibility criteria knowledge base, compatible with electronic health records, was developed by transforming hypernym concepts into machine-interpretable hyponyms along with their corresponding values. In addition, an interface prototype demonstrated the practicality of leveraging real-world data for optimizing clinical trial protocols and identifying eligible patients. Conclusions: Our customized NLP pipeline successfully generated a standardized eligibility criteria knowledge base by transforming hypernym criteria into machine-readable hyponyms along with their corresponding values. A prototype interface integrating real-world patient information allows us to assess the impact of each eligibility criterion on the number of patients eligible for the trial. Leveraging NLP and real-world data in a data-driven approach holds promise for streamlining the overall clinical trial process, optimizing processes, and improving efficiency in patient identification. ", doi="10.2196/50800", url="https://ai.jmir.org/2024/1/e50800", url="http://www.ncbi.nlm.nih.gov/pubmed/39073872" } @Article{info:doi/10.2196/56700, author="Kurasawa, Hisashi and Waki, Kayo and Seki, Tomohisa and Chiba, Akihiro and Fujino, Akinori and Hayashi, Katsuyoshi and Nakahara, Eri and Haga, Tsuneyuki and Noguchi, Takashi and Ohe, Kazuhiko", title="Enhancing Type 2 Diabetes Treatment Decisions With Interpretable Machine Learning Models for Predicting Hemoglobin A1c Changes: Machine Learning Model Development", journal="JMIR AI", year="2024", month="Jul", day="18", volume="3", pages="e56700", keywords="AI", keywords="artificial intelligence", keywords="attention weight", keywords="type 2 diabetes", keywords="blood glucose control", keywords="machine learning", keywords="transformer", abstract="Background: Type 2 diabetes (T2D) is a significant global health challenge. Physicians need to assess whether future glycemic control will be poor on the current trajectory of usual care and usual-care treatment intensifications so that they can consider taking extra treatment measures to prevent poor outcomes. Predicting poor glycemic control from trends in hemoglobin A1c (HbA1c) levels is difficult due to the influence of seasonal fluctuations and other factors. Objective: We sought to develop a model that accurately predicts poor glycemic control among patients with T2D receiving usual care. Methods: Our machine learning model predicts poor glycemic control (HbA1c?8\%) using the transformer architecture, incorporating an attention mechanism to process irregularly spaced HbA1c time series and quantify temporal relationships of past HbA1c levels at each time point. We assessed the model using HbA1c levels from 7787 patients with T2D seeing specialist physicians at the University of Tokyo Hospital. The training data include instances of poor glycemic control occurring during usual care with usual-care treatment intensifications. We compared prediction accuracy, assessed with the area under the receiver operating characteristic curve, the area under the precision-recall curve, and the accuracy rate, to that of LightGBM. Results: The area under the receiver operating characteristic curve, the area under the precision-recall curve, and the accuracy rate (95\% confidence limits) of the proposed model were 0.925 (95\% CI 0.923-0.928), 0.864 (95\% CI 0.852-0.875), and 0.864 (95\% CI 0.86-0.869), respectively. The proposed model achieved high prediction accuracy comparable to or surpassing LightGBM's performance. The model prioritized the most recent HbA1c levels for predictions. Older HbA1c levels in patients with poor glycemic control were slightly more influential in predictions compared to patients with good glycemic control. Conclusions: The proposed model accurately predicts poor glycemic control for patients with T2D receiving usual care, including patients receiving usual-care treatment intensifications, allowing physicians to identify cases warranting extraordinary treatment intensifications. If used by a nonspecialist, the model's indication of likely future poor glycemic control may warrant a referral to a specialist. Future efforts could incorporate diverse and large-scale clinical data for improved accuracy. ", doi="10.2196/56700", url="https://ai.jmir.org/2024/1/e56700", url="http://www.ncbi.nlm.nih.gov/pubmed/39024008" } @Article{info:doi/10.2196/51118, author="Baronetto, Annalisa and Graf, Luisa and Fischer, Sarah and Neurath, F. Markus and Amft, Oliver", title="Multiscale Bowel Sound Event Spotting in Highly Imbalanced Wearable Monitoring Data: Algorithm Development and Validation Study", journal="JMIR AI", year="2024", month="Jul", day="10", volume="3", pages="e51118", keywords="bowel sound", keywords="deep learning", keywords="event spotting", keywords="wearable sensors", abstract="Background: Abdominal auscultation (i.e., listening to bowel sounds (BSs)) can be used to analyze digestion. An automated retrieval of BS would be beneficial to assess gastrointestinal disorders noninvasively. Objective: This study aims to develop a multiscale spotting model to detect BSs in continuous audio data from a wearable monitoring system. Methods: We designed a spotting model based on the Efficient-U-Net (EffUNet) architecture to analyze 10-second audio segments at a time and spot BSs with a temporal resolution of 25 ms. Evaluation data were collected across different digestive phases from 18 healthy participants and 9 patients with inflammatory bowel disease (IBD). Audio data were recorded in a daytime setting with a smart T-Shirt that embeds digital microphones. The data set was annotated by independent raters with substantial agreement (Cohen $\kappa$ between 0.70 and 0.75), resulting in 136 hours of labeled data. In total, 11,482 BSs were analyzed, with a BS duration ranging between 18 ms and 6.3 seconds. The share of BSs in the data set (BS ratio) was 0.0089. We analyzed the performance depending on noise level, BS duration, and BS event rate. We also report spotting timing errors. Results: Leave-one-participant-out cross-validation of BS event spotting yielded a median F1-score of 0.73 for both healthy volunteers and patients with IBD. EffUNet detected BSs under different noise conditions with 0.73 recall and 0.72 precision. In particular, for a signal-to-noise ratio over 4 dB, more than 83\% of BSs were recognized, with precision of 0.77 or more. EffUNet recall dropped below 0.60 for BS duration of 1.5 seconds or less. At a BS ratio greater than 0.05, the precision of our model was over 0.83. For both healthy participants and patients with IBD, insertion and deletion timing errors were the largest, with a total of 15.54 minutes of insertion errors and 13.08 minutes of deletion errors over the total audio data set. On our data set, EffUNet outperformed existing BS spotting models that provide similar temporal resolution. Conclusions: The EffUNet spotter is robust against background noise and can retrieve BSs with varying duration. EffUNet outperforms previous BS detection approaches in unmodified audio data, containing highly sparse BS events. ", doi="10.2196/51118", url="https://ai.jmir.org/2024/1/e51118", url="http://www.ncbi.nlm.nih.gov/pubmed/38985504" } @Article{info:doi/10.2196/54798, author="De Souza, Jessica and Viswanath, Kumar Varun and Echterhoff, Maria Jessica and Chamberlain, Kristina and Wang, Jay Edward", title="Augmenting Telepostpartum Care With Vision-Based Detection of Breastfeeding-Related Conditions: Algorithm Development and Validation", journal="JMIR AI", year="2024", month="Jun", day="24", volume="3", pages="e54798", keywords="remote consultations", keywords="artificial intelligence", keywords="AI for health care", keywords="deep learning", keywords="detection model", keywords="breastfeeding", keywords="telehealth", keywords="perinatal health", keywords="image analysis", keywords="women's health", keywords="mobile phone", abstract="Background: Breastfeeding benefits both the mother and infant and is a topic of attention in public health. After childbirth, untreated medical conditions or lack of support lead many mothers to discontinue breastfeeding. For instance, nipple damage and mastitis affect 80\% and 20\% of US mothers, respectively. Lactation consultants (LCs) help mothers with breastfeeding, providing in-person, remote, and hybrid lactation support. LCs guide, encourage, and find ways for mothers to have a better experience breastfeeding. Current telehealth services help mothers seek LCs for breastfeeding support, where images help them identify and address many issues. Due to the disproportional ratio of LCs and mothers in need, these professionals are often overloaded and burned out. Objective: This study aims to investigate the effectiveness of 5 distinct convolutional neural networks in detecting healthy lactating breasts and 6 breastfeeding-related issues by only using red, green, and blue images. Our goal was to assess the applicability of this algorithm as an auxiliary resource for LCs to identify painful breast conditions quickly, better manage their patients through triage, respond promptly to patient needs, and enhance the overall experience and care for breastfeeding mothers. Methods: We evaluated the potential for 5 classification models to detect breastfeeding-related conditions using 1078 breast and nipple images gathered from web-based and physical educational resources. We used the convolutional neural networks Resnet50, Visual Geometry Group model with 16 layers (VGG16), InceptionV3, EfficientNetV2, and DenseNet169 to classify the images across 7 classes: healthy, abscess, mastitis, nipple blebs, dermatosis, engorgement, and nipple damage by improper feeding or misuse of breast pumps. We also evaluated the models' ability to distinguish between healthy and unhealthy images. We present an analysis of the classification challenges, identifying image traits that may confound the detection model. Results: The best model achieves an average area under the receiver operating characteristic curve of 0.93 for all conditions after data augmentation for multiclass classification. For binary classification, we achieved, with the best model, an average area under the curve of 0.96 for all conditions after data augmentation. Several factors contributed to the misclassification of images, including similar visual features in the conditions that precede other conditions (such as the mastitis spectrum disorder), partially covered breasts or nipples, and images depicting multiple conditions in the same breast. Conclusions: This vision-based automated detection technique offers an opportunity to enhance postpartum care for mothers and can potentially help alleviate the workload of LCs by expediting decision-making processes. ", doi="10.2196/54798", url="https://ai.jmir.org/2024/1/e54798" } @Article{info:doi/10.2196/47805, author="Mullick, Tahsin and Shaaban, Sam and Radovic, Ana and Doryab, Afsaneh", title="Framework for Ranking Machine Learning Predictions of Limited, Multimodal, and Longitudinal Behavioral Passive Sensing Data: Combining User-Agnostic and Personalized Modeling", journal="JMIR AI", year="2024", month="May", day="20", volume="3", pages="e47805", keywords="machine learning", keywords="AI", keywords="artificial intelligence", keywords="passive sensing", keywords="ranking framework", keywords="small health data set", keywords="ranking", keywords="algorithm", keywords="algorithms", keywords="sensor", keywords="multimodal", keywords="predict", keywords="prediction", keywords="agnostic", keywords="framework", keywords="validation", keywords="data set", abstract="Background: Passive mobile sensing provides opportunities for measuring and monitoring health status in the wild and outside of clinics. However, longitudinal, multimodal mobile sensor data can be small, noisy, and incomplete. This makes processing, modeling, and prediction of these data challenging. The small size of the data set restricts it from being modeled using complex deep learning networks. The current state of the art (SOTA) tackles small sensor data sets following a singular modeling paradigm based on traditional machine learning (ML) algorithms. These opt for either a user-agnostic modeling approach, making the model susceptible to a larger degree of noise, or a personalized approach, where training on individual data alludes to a more limited data set, giving rise to overfitting, therefore, ultimately, having to seek a trade-off by choosing 1 of the 2 modeling approaches to reach predictions. Objective: The objective of this study was to filter, rank, and output the best predictions for small, multimodal, longitudinal sensor data using a framework that is designed to tackle data sets that are limited in size (particularly targeting health studies that use passive multimodal sensors) and that combines both user agnostic and personalized approaches, along with a combination of ranking strategies to filter predictions. Methods: In this paper, we introduced a novel ranking framework for longitudinal multimodal sensors (FLMS) to address challenges encountered in health studies involving passive multimodal sensors. Using the FLMS, we (1) built a tensor-based aggregation and ranking strategy for final interpretation, (2) processed various combinations of sensor fusions, and (3) balanced user-agnostic and personalized modeling approaches with appropriate cross-validation strategies. The performance of the FLMS was validated with the help of a real data set of adolescents diagnosed with major depressive disorder for the prediction of change in depression in the adolescent participants. Results: Predictions output by the proposed FLMS achieved a 7\% increase in accuracy and a 13\% increase in recall for the real data set. Experiments with existing SOTA ML algorithms showed an 11\% increase in accuracy for the depression data set and how overfitting and sparsity were handled. Conclusions: The FLMS aims to fill the gap that currently exists when modeling passive sensor data with a small number of data points. It achieves this through leveraging both user-agnostic and personalized modeling techniques in tandem with an effective ranking strategy to filter predictions. ", doi="10.2196/47805", url="https://ai.jmir.org/2024/1/e47805", url="http://www.ncbi.nlm.nih.gov/pubmed/38875667" } @Article{info:doi/10.2196/48067, author="Kamruzzaman, Methun and Heavey, Jack and Song, Alexander and Bielskas, Matthew and Bhattacharya, Parantapa and Madden, Gregory and Klein, Eili and Deng, Xinwei and Vullikanti, Anil", title="Improving Risk Prediction of Methicillin-Resistant Staphylococcus aureus Using Machine Learning Methods With Network Features: Retrospective Development Study", journal="JMIR AI", year="2024", month="May", day="16", volume="3", pages="e48067", keywords="methicillin-resistant Staphylococcus aureus", keywords="network", keywords="machine learning", keywords="penalized logistic regression", keywords="ensemble learning", keywords="gradient-boosted classifier", keywords="random forest classifier", keywords="extreme boosted gradient boosted classifier", keywords="Shapley Additive Explanations", keywords="SHAP", keywords="health care--associated infection", keywords="HAI", abstract="Background: Health care--associated infections due to multidrug-resistant organisms (MDROs), such as methicillin-resistant Staphylococcus aureus (MRSA) and Clostridioides difficile (CDI), place a significant burden on our health care infrastructure. Objective: Screening for MDROs is an important mechanism for preventing spread but is resource intensive. The objective of this study was to develop automated tools that can predict colonization or infection risk using electronic health record (EHR) data, provide useful information to aid infection control, and guide empiric antibiotic coverage. Methods: We retrospectively developed a machine learning model to detect MRSA colonization and infection in undifferentiated patients at the time of sample collection from hospitalized patients at the University of Virginia Hospital. We used clinical and nonclinical features derived from on-admission and throughout-stay information from the patient's EHR data to build the model. In addition, we used a class of features derived from contact networks in EHR data; these network features can capture patients' contacts with providers and other patients, improving model interpretability and accuracy for predicting the outcome of surveillance tests for MRSA. Finally, we explored heterogeneous models for different patient subpopulations, for example, those admitted to an intensive care unit or emergency department or those with specific testing histories, which perform better. Results: We found that the penalized logistic regression performs better than other methods, and this model's performance measured in terms of its receiver operating characteristics-area under the curve score improves by nearly 11\% when we use polynomial (second-degree) transformation of the features. Some significant features in predicting MDRO risk include antibiotic use, surgery, use of devices, dialysis, patient's comorbidity conditions, and network features. Among these, network features add the most value and improve the model's performance by at least 15\%. The penalized logistic regression model with the same transformation of features also performs better than other models for specific patient subpopulations. Conclusions: Our study shows that MRSA risk prediction can be conducted quite effectively by machine learning methods using clinical and nonclinical features derived from EHR data. Network features are the most predictive and provide significant improvement over prior methods. Furthermore, heterogeneous prediction models for different patient subpopulations enhance the model's performance. ", doi="10.2196/48067", url="https://ai.jmir.org/2024/1/e48067", url="http://www.ncbi.nlm.nih.gov/pubmed/38875598" } @Article{info:doi/10.2196/52095, author="Majdik, P. Zoltan and Graham, Scott S. and Shiva Edward, C. Jade and Rodriguez, N. Sabrina and Karnes, S. Martha and Jensen, T. Jared and Barbour, B. Joshua and Rousseau, F. Justin", title="Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study", journal="JMIR AI", year="2024", month="May", day="16", volume="3", pages="e52095", keywords="named-entity recognition", keywords="large language models", keywords="fine-tuning", keywords="transfer learning", keywords="expert annotation", keywords="annotation", keywords="sample size", keywords="sample", keywords="language model", keywords="machine learning", keywords="natural language processing", keywords="disclosure", keywords="disclosures", keywords="statement", keywords="statements", keywords="conflict of interest", abstract="Background: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. Objective: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. Methods: A random sample of 200 disclosure statements was prepared for annotation. All ``PERSON'' and ``ORG'' entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. Results: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2\_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa\_large to 527 sentences for GPT-2\_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. Conclusions: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size. ", doi="10.2196/52095", url="https://ai.jmir.org/2024/1/e52095", url="http://www.ncbi.nlm.nih.gov/pubmed/38875593" } @Article{info:doi/10.2196/52171, author="Li, Joe and Washington, Peter", title="A Comparison of Personalized and Generalized Approaches to Emotion Recognition Using Consumer Wearable Devices: Machine Learning Study", journal="JMIR AI", year="2024", month="May", day="10", volume="3", pages="e52171", keywords="affect detection", keywords="affective computing", keywords="deep learning", keywords="digital health", keywords="emotion recognition", keywords="machine learning", keywords="mental health", keywords="personalization", keywords="stress detection", keywords="wearable technology", abstract="Background: There are a wide range of potential adverse health effects, ranging from headaches to cardiovascular disease, associated with long-term negative emotions and chronic stress. Because many indicators of stress are imperceptible to observers, the early detection of stress remains a pressing medical need, as it can enable early intervention. Physiological signals offer a noninvasive method for monitoring affective states and are recorded by a growing number of commercially available wearables. Objective: We aim to study the differences between personalized and generalized machine learning models for 3-class emotion classification (neutral, stress, and amusement) using wearable biosignal data. Methods: We developed a neural network for the 3-class emotion classification problem using data from the Wearable Stress and Affect Detection (WESAD) data set, a multimodal data set with physiological signals from 15 participants. We compared the results between a participant-exclusive generalized, a participant-inclusive generalized, and a personalized deep learning model. Results: For the 3-class classification problem, our personalized model achieved an average accuracy of 95.06\% and an F1-score of 91.71\%; our participant-inclusive generalized model achieved an average accuracy of 66.95\% and an F1-score of 42.50\%; and our participant-exclusive generalized model achieved an average accuracy of 67.65\% and an F1-score of 43.05\%. Conclusions: Our results emphasize the need for increased research in personalized emotion recognition models given that they outperform generalized models in certain contexts. We also demonstrate that personalized machine learning models for emotion classification are viable and can achieve high performance. ", doi="10.2196/52171", url="https://ai.jmir.org/2024/1/e52171", url="http://www.ncbi.nlm.nih.gov/pubmed/38875573" } @Article{info:doi/10.2196/42630, author="Zhang, Boya and Naderi, Nona and Mishra, Rahul and Teodoro, Douglas", title="Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and Validation", journal="JMIR AI", year="2024", month="May", day="2", volume="3", pages="e42630", keywords="health misinformation", keywords="information retrieval", keywords="deep learning", keywords="language model", keywords="transfer learning", keywords="infodemic", abstract="Background: Widespread misinformation in web resources can lead to serious implications for individuals seeking health advice. Despite that, information retrieval models are often focused only on the query-document relevance dimension to rank results. Objective: We investigate a multidimensional information quality retrieval model based on deep learning to enhance the effectiveness of online health care information search results. Methods: In this study, we simulated online health information search scenarios with a topic set of 32 different health-related inquiries and a corpus containing 1 billion web documents from the April 2019 snapshot of Common Crawl. Using state-of-the-art pretrained language models, we assessed the quality of the retrieved documents according to their usefulness, supportiveness, and credibility dimensions for a given search query on 6030 human-annotated, query-document pairs. We evaluated this approach using transfer learning and more specific domain adaptation techniques. Results: In the transfer learning setting, the usefulness model provided the largest distinction between help- and harm-compatible documents, with a difference of +5.6\%, leading to a majority of helpful documents in the top 10 retrieved. The supportiveness model achieved the best harm compatibility (+2.4\%), while the combination of usefulness, supportiveness, and credibility models achieved the largest distinction between help- and harm-compatibility on helpful topics (+16.9\%). In the domain adaptation setting, the linear combination of different models showed robust performance, with help-harm compatibility above +4.4\% for all dimensions and going as high as +6.8\%. Conclusions: These results suggest that integrating automatic ranking models created for specific information quality dimensions can increase the effectiveness of health-related information retrieval. Thus, our approach could be used to enhance searches made by individuals seeking online health information. ", doi="10.2196/42630", url="https://ai.jmir.org/2024/1/e42630", url="http://www.ncbi.nlm.nih.gov/pubmed/38875551" } @Article{info:doi/10.2196/52615, author="Yan, Chao and Zhang, Ziqi and Nyemba, Steve and Li, Zhuohang", title="Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial", journal="JMIR AI", year="2024", month="Apr", day="22", volume="3", pages="e52615", keywords="synthetic data generation", keywords="electronic health record", keywords="generative neural networks", keywords="tutorial", doi="10.2196/52615", url="https://ai.jmir.org/2024/1/e52615", url="http://www.ncbi.nlm.nih.gov/pubmed/38875595" } @Article{info:doi/10.2196/47122, author="Rodriguez, V. Danissa and Chen, Ji and Viswanadham, N. Ratnalekha V. and Lawrence, Katharine and Mann, Devin", title="Leveraging Machine Learning to Develop Digital Engagement Phenotypes of Users in a Digital Diabetes Prevention Program: Evaluation Study", journal="JMIR AI", year="2024", month="Mar", day="1", volume="3", pages="e47122", keywords="machine learning", keywords="digital health", keywords="diabetes", keywords="mobile health", keywords="messaging platforms", keywords="user engagement", keywords="patient behavior", keywords="digital diabetes prevention programs", keywords="digital phenotypes", keywords="digital prescription", keywords="users", keywords="prevention", keywords="evaluation study", keywords="communication", keywords="support", keywords="engagement", keywords="phenotypes", keywords="digital health intervention", keywords="chronic disease management", abstract="Background: Digital diabetes prevention programs (dDPPs) are effective ``digital prescriptions'' but have high attrition rates and program noncompletion. To address this, we developed a personalized automatic messaging system (PAMS) that leverages SMS text messaging and data integration into clinical workflows to increase dDPP engagement via enhanced patient-provider communication. Preliminary data showed positive results. However, further investigation is needed to determine how to optimize the tailoring of support technology such as PAMS based on a user's preferences to boost their dDPP engagement. Objective: This study evaluates leveraging machine learning (ML) to develop digital engagement phenotypes of dDPP users and assess ML's accuracy in predicting engagement with dDPP activities. This research will be used in a PAMS optimization process to improve PAMS personalization by incorporating engagement prediction and digital phenotyping. This study aims (1) to prove the feasibility of using dDPP user-collected data to build an ML model that predicts engagement and contributes to identifying digital engagement phenotypes, (2) to describe methods for developing ML models with dDPP data sets and present preliminary results, and (3) to present preliminary data on user profiling based on ML model outputs. Methods: Using the gradient-boosted forest model, we predicted engagement in 4 dDPP individual activities (physical activity, lessons, social activity, and weigh-ins) and general activity (engagement in any activity) based on previous short- and long-term activity in the app. The area under the receiver operating characteristic curve, the area under the precision-recall curve, and the Brier score metrics determined the performance of the model. Shapley values reflected the feature importance of the models and determined what variables informed user profiling through latent profile analysis. Results: We developed 2 models using weekly and daily DPP data sets (328,821 and 704,242 records, respectively), which yielded predictive accuracies above 90\%. Although both models were highly accurate, the daily model better fitted our research plan because it predicted daily changes in individual activities, which was crucial for creating the ``digital phenotypes.'' To better understand the variables contributing to the model predictor, we calculated the Shapley values for both models to identify the features with the highest contribution to model fit; engagement with any activity in the dDPP in the last 7 days had the most predictive power. We profiled users with latent profile analysis after 2 weeks of engagement (Bayesian information criterion=?3222.46) with the dDPP and identified 6 profiles of users, including those with high engagement, minimal engagement, and attrition. Conclusions: Preliminary results demonstrate that applying ML methods with predicting power is an acceptable mechanism to tailor and optimize messaging interventions to support patient engagement and adherence to digital prescriptions. The results enable future optimization of our existing messaging platform and expansion of this methodology to other clinical domains. Trial Registration: ClinicalTrials.gov NCT04773834; https://www.clinicaltrials.gov/ct2/show/NCT04773834 International Registered Report Identifier (IRRID): RR2-10.2196/26750 ", doi="10.2196/47122", url="https://ai.jmir.org/2024/1/e47122", url="http://www.ncbi.nlm.nih.gov/pubmed/38875579" } @Article{info:doi/10.2196/44185, author="Pan, Cheng and Luo, Hao and Cheung, Gary and Zhou, Huiquan and Cheng, Reynold and Cullum, Sarah and Wu, Chuan", title="Identifying Frailty in Older Adults Receiving Home Care Assessment Using Machine Learning: Longitudinal Observational Study on the Role of Classifier, Feature Selection, and Sample Size", journal="JMIR AI", year="2024", month="Jan", day="31", volume="3", pages="e44185", keywords="machine learning", keywords="logistic regression", keywords="frailty", keywords="older adults", keywords="home care", keywords="sample size", keywords="features", keywords="data set", keywords="model", keywords="mortality prediction", keywords="assessment", abstract="Background: Machine learning techniques are starting to be used in various health care data sets to identify frail persons who may benefit from interventions. However, evidence about the performance of machine learning techniques compared to conventional regression is mixed. It is also unclear what methodological and database factors are associated with performance. Objective: This study aimed to compare the mortality prediction accuracy of various machine learning classifiers for identifying frail older adults in different scenarios. Methods: We used deidentified data collected from older adults (65 years of age and older) assessed with interRAI-Home Care instrument in New Zealand between January 1, 2012, and December 31, 2016. A total of 138 interRAI assessment items were used to predict 6-month and 12-month mortality, using 3 machine learning classifiers (random forest [RF], extreme gradient boosting [XGBoost], and multilayer perceptron [MLP]) and regularized logistic regression. We conducted a simulation study comparing the performance of machine learning models with logistic regression and interRAI Home Care Frailty Scale and examined the effects of sample sizes, the number of features, and train-test split ratios. Results: A total of 95,042 older adults (median age 82.66 years, IQR 77.92-88.76; n=37,462, 39.42\% male) receiving home care were analyzed. The average area under the curve (AUC) and sensitivities of 6-month mortality prediction showed that machine learning classifiers did not outperform regularized logistic regressions. In terms of AUC, regularized logistic regression had better performance than XGBoost, MLP, and RF when the number of features was ?80 and the sample size ?16,000; MLP outperformed regularized logistic regression in terms of sensitivities when the number of features was ?40 and the sample size ?4000. Conversely, RF and XGBoost demonstrated higher specificities than regularized logistic regression in all scenarios. Conclusions: The study revealed that machine learning models exhibited significant variation in prediction performance when evaluated using different metrics. Regularized logistic regression was an effective model for identifying frail older adults receiving home care, as indicated by the AUC, particularly when the number of features and sample sizes were not excessively large. Conversely, MLP displayed superior sensitivity, while RF exhibited superior specificity when the number of features and sample sizes were large. ", doi="10.2196/44185", url="https://ai.jmir.org/2024/1/e44185" } @Article{info:doi/10.2196/49784, author="Brann, Felix and Sterling, William Nicholas and Frisch, O. Stephanie and Schrager, D. Justin", title="Sepsis Prediction at Emergency Department Triage Using Natural Language Processing: Retrospective Cohort Study", journal="JMIR AI", year="2024", month="Jan", day="25", volume="3", pages="e49784", keywords="natural language processing", keywords="machine learning", keywords="sepsis", keywords="emergency department", keywords="triage", abstract="Background: Despite its high lethality, sepsis can be difficult to detect on initial presentation to the emergency department (ED). Machine learning--based tools may provide avenues for earlier detection and lifesaving intervention. Objective: The study aimed to predict sepsis at the time of ED triage using natural language processing of nursing triage notes and available clinical data. Methods: We constructed a retrospective cohort of all 1,234,434 consecutive ED encounters in 2015-2021 from 4 separate clinically heterogeneous academically affiliated EDs. After exclusion criteria were applied, the final cohort included 1,059,386 adult ED encounters. The primary outcome criteria for sepsis were presumed severe infection and acute organ dysfunction. After vectorization and dimensional reduction of triage notes and clinical data available at triage, a decision tree--based ensemble (time-of-triage) model was trained to predict sepsis using the training subset (n=950,921). A separate (comprehensive) model was trained using these data and laboratory data, as it became available at 1-hour intervals, after triage. Model performances were evaluated using the test (n=108,465) subset. Results: Sepsis occurred in 35,318 encounters (incidence 3.45\%). For sepsis prediction at the time of patient triage, using the primary definition, the area under the receiver operating characteristic curve (AUC) and macro F1-score for sepsis were 0.94 and 0.61, respectively. Sensitivity, specificity, and false positive rate were 0.87, 0.85, and 0.15, respectively. The time-of-triage model accurately predicted sepsis in 76\% (1635/2150) of sepsis cases where sepsis screening was not initiated at triage and 97.5\% (1630/1671) of cases where sepsis screening was initiated at triage. Positive and negative predictive values were 0.18 and 0.99, respectively. For sepsis prediction using laboratory data available each hour after ED arrival, the AUC peaked to 0.97 at 12 hours. Similar results were obtained when stratifying by hospital and when Centers for Disease Control and Prevention hospital toolkit for adult sepsis surveillance criteria were used to define sepsis. Among septic cases, sepsis was predicted in 36.1\% (1375/3814), 49.9\% (1902/3814), and 68.3\% (2604/3814) of encounters, respectively, at 3, 2, and 1 hours prior to the first intravenous antibiotic order or where antibiotics where not ordered within the first 12 hours. Conclusions: Sepsis can accurately be predicted at ED presentation using nursing triage notes and clinical information available at the time of triage. This indicates that machine learning can facilitate timely and reliable alerting for intervention. Free-text data can improve the performance of predictive modeling at the time of triage and throughout the ED course. ", doi="10.2196/49784", url="https://ai.jmir.org/2024/1/e49784", url="http://www.ncbi.nlm.nih.gov/pubmed/38875594" } @Article{info:doi/10.2196/51240, author="Xie, Fagen and Chang, Jenny and Luong, Tiffany and Wu, Bechien and Lustigova, Eva and Shrader, Eva and Chen, Wansu", title="Identifying Symptoms Prior to Pancreatic Ductal Adenocarcinoma Diagnosis in Real-World Care Settings: Natural Language Processing Approach", journal="JMIR AI", year="2024", month="Jan", day="15", volume="3", pages="e51240", keywords="cancer", keywords="pancreatic ductal adenocarcinoma", keywords="symptom", keywords="clinical note", keywords="electronic health record", keywords="natural language processing", keywords="computerized algorithm", keywords="pancreatic cancer", keywords="cancer death", keywords="abdominal pain", keywords="pain", keywords="validation", keywords="detection", keywords="pancreas", abstract="Background: Pancreatic cancer is the third leading cause of cancer deaths in the United States. Pancreatic ductal adenocarcinoma (PDAC) is the most common form of pancreatic cancer, accounting for up to 90\% of all cases. Patient-reported symptoms are often the triggers of cancer diagnosis and therefore, understanding the PDAC-associated symptoms and the timing of symptom onset could facilitate early detection of PDAC. Objective: This paper aims to develop a natural language processing (NLP) algorithm to capture symptoms associated with PDAC from clinical notes within a large integrated health care system. Methods: We used unstructured data within 2 years prior to PDAC diagnosis between 2010 and 2019 and among matched patients without PDAC to identify 17 PDAC-related symptoms. Related terms and phrases were first compiled from publicly available resources and then recursively reviewed and enriched with input from clinicians and chart review. A computerized NLP algorithm was iteratively developed and fine-trained via multiple rounds of chart review followed by adjudication. Finally, the developed algorithm was applied to the validation data set to assess performance and to the study implementation notes. Results: A total of 408,147 and 709,789 notes were retrieved from 2611 patients with PDAC and 10,085 matched patients without PDAC, respectively. In descending order, the symptom distribution of the study implementation notes ranged from 4.98\% for abdominal or epigastric pain to 0.05\% for upper extremity deep vein thrombosis in the PDAC group, and from 1.75\% for back pain to 0.01\% for pale stool in the non-PDAC group. Validation of the NLP algorithm against adjudicated chart review results of 1000 notes showed that precision ranged from 98.9\% (jaundice) to 84\% (upper extremity deep vein thrombosis), recall ranged from 98.1\% (weight loss) to 82.8\% (epigastric bloating), and F1-scores ranged from 0.97 (jaundice) to 0.86 (depression). Conclusions: The developed and validated NLP algorithm could be used for the early detection of PDAC. ", doi="10.2196/51240", url="https://ai.jmir.org/2024/1/e51240", url="http://www.ncbi.nlm.nih.gov/pubmed/38875566" } @Article{info:doi/10.2196/46840, author="Irie, Fumi and Matsumoto, Koutarou and Matsuo, Ryu and Nohara, Yasunobu and Wakisaka, Yoshinobu and Ago, Tetsuro and Nakashima, Naoki and Kitazono, Takanari and Kamouchi, Masahiro", title="Predictive Performance of Machine Learning--Based Models for Poststroke Clinical Outcomes in Comparison With Conventional Prognostic Scores: Multicenter, Hospital-Based Observational Study", journal="JMIR AI", year="2024", month="Jan", day="11", volume="3", pages="e46840", keywords="brain infarction", keywords="outcome", keywords="prediction", keywords="machine learning", keywords="prognostic score", abstract="Background: Although machine learning is a promising tool for making prognoses, the performance of machine learning in predicting outcomes after stroke remains to be examined. Objective: This study aims to examine how much data-driven models with machine learning improve predictive performance for poststroke outcomes compared with conventional stroke prognostic scores and to elucidate how explanatory variables in machine learning--based models differ from the items of the stroke prognostic scores. Methods: We used data from 10,513 patients who were registered in a multicenter prospective stroke registry in Japan between 2007 and 2017. The outcomes were poor functional outcome (modified Rankin Scale score >2) and death at 3 months after stroke. Machine learning--based models were developed using all variables with regularization methods, random forests, or boosted trees. We selected 3 stroke prognostic scores, namely, ASTRAL (Acute Stroke Registry and Analysis of Lausanne), PLAN (preadmission comorbidities, level of consciousness, age, neurologic deficit), and iScore (Ischemic Stroke Predictive Risk Score) for comparison. Item-based regression models were developed using the items of these 3 scores. The model performance was assessed in terms of discrimination and calibration. To compare the predictive performance of the data-driven model with that of the item-based model, we performed internal validation after random splits of identical populations into 80\% of patients as a training set and 20\% of patients as a test set; the models were developed in the training set and were validated in the test set. We evaluated the contribution of each variable to the models and compared the predictors used in the machine learning--based models with the items of the stroke prognostic scores. Results: The mean age of the study patients was 73.0 (SD 12.5) years, and 59.1\% (6209/10,513) of them were men. The area under the receiver operating characteristic curves and the area under the precision-recall curves for predicting poststroke outcomes were higher for machine learning--based models than for item-based models in identical populations after random splits. Machine learning--based models also performed better than item-based models in terms of the Brier score. Machine learning--based models used different explanatory variables, such as laboratory data, from the items of the conventional stroke prognostic scores. Including these data in the machine learning--based models as explanatory variables improved performance in predicting outcomes after stroke, especially poststroke death. Conclusions: Machine learning--based models performed better in predicting poststroke outcomes than regression models using the items of conventional stroke prognostic scores, although they required additional variables, such as laboratory data, to attain improved performance. Further studies are warranted to validate the usefulness of machine learning in clinical settings. ", doi="10.2196/46840", url="https://ai.jmir.org/2024/1/e46840", url="http://www.ncbi.nlm.nih.gov/pubmed/38875590" } @Article{info:doi/10.2196/51168, author="Karpathakis, Kassandra and Pencheon, Emma and Cushnan, Dominic", title="Learning From International Comparators of National Medical Imaging Initiatives for AI Development: Multiphase Qualitative Study", journal="JMIR AI", year="2024", month="Jan", day="4", volume="3", pages="e51168", keywords="digital health", keywords="mobile health", keywords="mHealth", keywords="medical imaging", keywords="artificial intelligence", keywords="health policy", abstract="Background: The COVID-19 pandemic drove investment and research into medical imaging platforms to provide data to create artificial intelligence (AI) algorithms for the management of patients with COVID-19. Building on the success of England's National COVID-19 Chest Imaging Database, the national digital policy body (NHSX) sought to create a generalized national medical imaging platform for the development, validation, and deployment of algorithms. Objective: This study aims to understand international use cases of medical imaging platforms for the development and implementation of algorithms to inform the creation of England's national imaging platform. Methods: The National Health Service (NHS) AI Lab Policy and Strategy Team adopted a multiphased approach: (1) identification and prioritization of national AI imaging platforms; (2) Political, Economic, Social, Technological, Legal, and Environmental (PESTLE) factor analysis deep dive into national AI imaging platforms; (3) semistructured interviews with key stakeholders; (4) workshop on emerging themes and insights with the internal NHSX team; and (5) formulation of policy recommendations. Results: International use cases of national AI imaging platforms (n=7) were prioritized for PESTLE factor analysis. Stakeholders (n=13) from the international use cases were interviewed. Themes (n=8) from the semistructured interviews, including interview quotes, were analyzed with workshop participants (n=5). The outputs of the deep dives, interviews, and workshop were synthesized thematically into 8 categories with 17 subcategories. On the basis of the insights from the international use cases, policy recommendations (n=12) were developed to support the NHS AI Lab in the design and development of the English national medical imaging platform. Conclusions: The creation of AI algorithms supporting technology and infrastructure such as platforms often occurs in isolation within countries, let alone between countries. This novel policy research project sought to bridge the gap by learning from the challenges, successes, and experience of England's international counterparts. Policy recommendations based on international learnings focused on the demonstrable benefits of the platform to secure sustainable funding, validation of algorithms and infrastructure to support in situ deployment, and creating wraparound tools for nontechnical participants such as clinicians to engage with algorithm creation. As health care organizations increasingly adopt technological solutions, policy makers have a responsibility to ensure that initiatives are informed by learnings from both national and international initiatives as well as disseminating the outcomes of their work. ", doi="10.2196/51168", url="https://ai.jmir.org/2024/1/e51168" } @Article{info:doi/10.2196/40965, author="Pita Ferreira, Patr{\'i}cia and Godinho Sim{\~o}es, Diogo and Pinto de Carvalho, Constan{\c{c}}a and Duarte, Francisco and Fernandes, Eug{\'e}nia and Casaca Carvalho, Pedro and Loff, Francisco Jos{\'e} and Soares, Paula Ana and Albuquerque, Jo{\~a}o Maria and Pinto-Leite, Pedro and Peralta-Santos, Andr{\'e}", title="Real-Time Classification of Causes of Death Using AI: Sensitivity Analysis", journal="JMIR AI", year="2023", month="Nov", day="22", volume="2", pages="e40965", keywords="artificial intelligence", keywords="AI", keywords="mortality", keywords="deep neural networks", keywords="evaluation", keywords="machine learning", keywords="deep learning", keywords="mortality statistics", keywords="underlying cause of death", abstract="Background: In 2021, the European Union reported >270,000 excess deaths, including >16,000 in Portugal. The Portuguese Directorate-General of Health developed a deep neural network, AUTOCOD, which determines the primary causes of death by analyzing the free text of physicians' death certificates (DCs). Although AUTOCOD's performance has been established, it remains unclear whether its performance remains consistent over time, particularly during periods of excess mortality. Objective: This study aims to assess the sensitivity and other performance metrics of AUTOCOD in classifying underlying causes of death compared with manual coding to identify specific causes of death during periods of excess mortality. Methods: We included all DCs between 2016 and 2019. AUTOCOD's performance was evaluated by calculating various performance metrics, such as sensitivity, specificity, positive predictive value (PPV), and F1-score, using a confusion matrix. This compared International Statistical Classification of Diseases and Health-Related Problems, 10th Revision (ICD-10), classifications of DCs by AUTOCOD with those by human coders at the Directorate-General of Health (gold standard). Subsequently, we compared periods without excess mortality with periods of excess, severe, and extreme excess mortality. We defined excess mortality as 2 consecutive days with a Z score above the 95\% baseline limit, severe excess mortality as 2 consecutive days with a Z score >4 SDs, and extreme excess mortality as 2 consecutive days with a Z score >6 SDs. Finally, we repeated the analyses for the 3 most common ICD-10 chapters focusing on block-level classification. Results: We analyzed a large data set comprising 330,098 DCs classified by both human coders and AUTOCOD. AUTOCOD demonstrated high sensitivity (?0.75) for 10 ICD-10 chapters examined, with values surpassing 0.90 for the more prevalent chapters (chapter II---``Neoplasms,'' chapter IX---``Diseases of the circulatory system,'' and chapter X---``Diseases of the respiratory system''), accounting for 67.69\% (223,459/330,098) of all human-coded causes of death. No substantial differences were observed in these high-sensitivity values when comparing periods without excess mortality with periods of excess, severe, and extreme excess mortality. The same holds for specificity, which exceeded 0.96 for all chapters examined, and for PPV, which surpassed 0.75 in 9 chapters, including the more prevalent ones. When considering block classification within the 3 most common ICD-10 chapters, AUTOCOD maintained a high performance, demonstrating high sensitivity (?0.75) for 13 ICD-10 blocks, high PPV for 9 blocks, and specificity of >0.98 in all blocks, with no significant differences between periods without excess mortality and those with excess mortality. Conclusions: Our findings indicate that, during periods of excess and extreme excess mortality, AUTOCOD's performance remains unaffected by potential text quality degradation because of pressure on health services. Consequently, AUTOCOD can be dependably used for real-time cause-specific mortality surveillance even in extreme excess mortality situations. ", doi="10.2196/40965", url="https://ai.jmir.org/2023/1/e40965", url="http://www.ncbi.nlm.nih.gov/pubmed/38875558" } @Article{info:doi/10.2196/45257, author="Lashen, Hazem and St John, Lee Terrence and Almallah, Zaki Y. and Sasidhar, Madhu and Shamout, E. Farah", title="Machine Learning Models Versus the National Early Warning Score System for Predicting Deterioration: Retrospective Cohort Study in the United Arab Emirates", journal="JMIR AI", year="2023", month="Nov", day="6", volume="2", pages="e45257", keywords="machine learning", keywords="early warning score system", keywords="clinical deterioration", keywords="early warning", keywords="score system", keywords="cohort", keywords="real-world data", keywords="neural network", keywords="predict", keywords="deterioration", abstract="Background: Early warning score systems are widely used for identifying patients who are at the highest risk of deterioration to assist clinical decision-making. This could facilitate early intervention and consequently improve patient outcomes; for example, the National Early Warning Score (NEWS) system, which is recommended by the Royal College of Physicians in the United Kingdom, uses predefined alerting thresholds to assign scores to patients based on their vital signs. However, there is limited evidence of the reliability of such scores across patient cohorts in the United Arab Emirates. Objective: Our aim in this study was to propose a data-driven model that accurately predicts in-hospital deterioration in an inpatient cohort in the United Arab Emirates. Methods: We conducted a retrospective cohort study using a real-world data set that consisted of 16,901 unique patients associated with 26,073 inpatient emergency encounters and 951,591 observation sets collected between April 2015 and August 2021 at a large multispecialty hospital in Abu Dhabi, United Arab Emirates. The observation sets included routine measurements of heart rate, respiratory rate, systolic blood pressure, level of consciousness, temperature, and oxygen saturation, as well as whether the patient was receiving supplementary oxygen. We divided the data set of 16,901 unique patients into training, validation, and test sets consisting of 11,830 (70\%; 18,319/26,073, 70.26\% emergency encounters), 3397 (20.1\%; 5206/26,073, 19.97\% emergency encounters), and 1674 (9.9\%; 2548/26,073, 9.77\% emergency encounters) patients, respectively. We defined an adverse event as the occurrence of admission to the intensive care unit, mortality, or both if the patient was admitted to the intensive care unit first. On the basis of 7 routine vital signs measurements, we assessed the performance of the NEWS system in detecting deterioration within 24 hours using the area under the receiver operating characteristic curve (AUROC). We also developed and evaluated several machine learning models, including logistic regression, a gradient-boosting model, and a feed-forward neural network. Results: In a holdout test set of 2548 encounters with 95,755 observation sets, the NEWS system achieved an overall AUROC value of 0.682 (95\% CI 0.673-0.690). In comparison, the best-performing machine learning models, which were the gradient-boosting model and the neural network, achieved AUROC values of 0.778 (95\% CI 0.770-0.785) and 0.756 (95\% CI 0.749-0.764), respectively. Our interpretability results highlight the importance of temperature and respiratory rate in predicting patient deterioration. Conclusions: Although traditional early warning score systems are the dominant form of deterioration prediction models in clinical practice today, we strongly recommend the development and use of cohort-specific machine learning models as an alternative. This is especially important in external patient cohorts that were unseen during model development. ", doi="10.2196/45257", url="https://ai.jmir.org/2023/1/e45257", url="http://www.ncbi.nlm.nih.gov/pubmed/38875543" } @Article{info:doi/10.2196/48340, author="Shi, Bohan and Dhaliwal, Singh Satvinder and Soo, Marcus and Chan, Cheri and Wong, Jocelin and Lam, C. Natalie W. and Zhou, Entong and Paitimusa, Vivien and Loke, Yin Kum and Chin, Joel and Chua, Tuan Mei and Liaw, Suan Kathy Chiew and Lim, H. Amos W. and Insyirah, Fatin Fadil and Yen, Shih-Cheng and Tay, Arthur and Ang, Bin Seng", title="Assessing Elevated Blood Glucose Levels Through Blood Glucose Evaluation and Monitoring Using Machine Learning and Wearable Photoplethysmography Sensors: Algorithm Development and Validation", journal="JMIR AI", year="2023", month="Oct", day="27", volume="2", pages="e48340", keywords="diabetes mellitus", keywords="explainable artificial intelligence", keywords="feature engineering", keywords="machine learning", keywords="photoplethysmography", keywords="wearable sensor", abstract="Background: Diabetes mellitus is the most challenging and fastest-growing global public health concern. Approximately 10.5\% of the global adult population is affected by diabetes, and almost half of them are undiagnosed. The growing at-risk population exacerbates the shortage of health resources, with an estimated 10.6\% and 6.2\% of adults worldwide having impaired glucose tolerance and impaired fasting glycemia, respectively. All current diabetes screening methods are invasive and opportunistic and must be conducted in a hospital or laboratory by trained professionals. At-risk participants might remain undetected for years and miss the precious time window for early intervention to prevent or delay the onset of diabetes and its complications. Objective: We aimed to develop an artificial intelligence solution to recognize elevated blood glucose levels (?7.8 mmol/L) noninvasively and evaluate diabetic risk based on repeated measurements. Methods: This study was conducted at KK Women's and Children's Hospital in Singapore, and 500 participants were recruited (mean age 38.73, SD 10.61 years; mean BMI 24.4, SD 5.1 kg/m2). The blood glucose levels for most participants were measured before and after consuming 75 g of sugary drinks using both a conventional glucometer (Accu-Chek Performa) and a wrist-worn wearable. The results obtained from the glucometer were used as ground-truth measurements. We performed extensive feature engineering on photoplethysmography (PPG) sensor data and identified features that were sensitive to glucose changes. These selected features were further analyzed using an explainable artificial intelligence approach to understand their contribution to our predictions. Results: Multiple machine learning models were trained and assessed with 10-fold cross-validation, using participant demographic data and critical features extracted from PPG measurements as predictors. A support vector machine with a radial basis function kernel had the best detection performance, with an average accuracy of 84.7\%, a sensitivity of 81.05\%, a specificity of 88.3\%, a precision of 87.51\%, a geometric mean of 84.54\%, and F score of 84.03\%. Conclusions: Our findings suggest that PPG measurements can be used to identify participants with elevated blood glucose measurements and assist in the screening of participants for diabetes risk. ", doi="10.2196/48340", url="https://ai.jmir.org/2023/1/e48340", url="http://www.ncbi.nlm.nih.gov/pubmed/38875549" } @Article{info:doi/10.2196/43483, author="Saraswat, Nidhi and Li, Chuqin and Jiang, Min", title="Identifying the Question Similarity of Regulatory Documents in the Pharmaceutical Industry by Using the Recognizing Question Entailment System: Evaluation Study", journal="JMIR AI", year="2023", month="Sep", day="26", volume="2", pages="e43483", keywords="regulatory affairs", keywords="frequently asked questions", keywords="FAQs", keywords="Recognizing Question Entailment system", keywords="RQE system", keywords="transformer-based models", keywords="textual data augmentations", abstract="Background: The regulatory affairs (RA) division in a pharmaceutical establishment is the point of contact between regulatory authorities and pharmaceutical companies. They are delegated the crucial and strenuous task of extracting and summarizing relevant information in the most meticulous manner from various search systems. An artificial intelligence (AI)--based intelligent search system that can significantly bring down the manual efforts in the existing processes of the RA department while maintaining and improving the quality of final outcomes is desirable. We proposed a ``frequently asked questions'' component and its utility in an AI-based intelligent search system in this paper. The scenario is further complicated by the lack of publicly available relevant data sets in the RA domain to train the machine learning models that can facilitate cognitive search systems for regulatory authorities. Objective: In this study, we aimed to use AI-based intelligent computational models to automatically recognize semantically similar question pairs in the RA domain and evaluate the Recognizing Question Entailment--based system. Methods: We used transfer learning techniques and experimented with transformer-based models pretrained on corpora collected from different resources, such as Bidirectional Encoder Representations from Transformers (BERT), Clinical BERT, BioBERT, and BlueBERT. We used a manually labeled data set that contained 150 question pairs in the pharmaceutical regulatory domain to evaluate the performance of our model. Results: The Clinical BERT model performed better than other domain-specific BERT-based models in identifying question similarity from the RA domain. The BERT model had the best ability to learn domain-specific knowledge with transfer learning, which reached the best performance when fine-tuned with sufficient clinical domain question pairs. The top-performing model achieved an accuracy of 90.66\% on the test set. Conclusions: This study demonstrates the possibility of using pretrained language models to recognize question similarity in the pharmaceutical regulatory domain. Transformer-based models that are pretrained on clinical notes perform better than models pretrained on biomedical text in recognizing the question's semantic similarity in this domain. We also discuss the challenges of using data augmentation techniques to address the lack of relevant data in this domain. The results of our experiment indicated that increasing the number of training samples using back translation and entity replacement did not enhance the model's performance. This lack of improvement may be attributed to the intricate and specialized nature of texts in the regulatory domain. Our work provides the foundation for further studies that apply state-of-the-art linguistic models to regulatory documents in the pharmaceutical industry. ", doi="10.2196/43483", url="https://ai.jmir.org/2023/1/e43483" } @Article{info:doi/10.2196/44909, author="Kendale, Samir and Bishara, Andrew and Burns, Michael and Solomon, Stuart and Corriere, Matthew and Mathis, Michael", title="Machine Learning for the Prediction of Procedural Case Durations Developed Using a Large Multicenter Database: Algorithm Development and Validation Study", journal="JMIR AI", year="2023", month="Sep", day="8", volume="2", pages="e44909", keywords="medical informatics", keywords="artificial intelligence", keywords="AI", keywords="machine learning", keywords="operating room", keywords="OR management", keywords="perioperative", keywords="algorithm development", keywords="validation", keywords="patient communication", keywords="surgical procedure", keywords="prediction model", abstract="Background: Accurate projections of procedural case durations are complex but critical to the planning of perioperative staffing, operating room resources, and patient communication. Nonlinear prediction models using machine learning methods may provide opportunities for hospitals to improve upon current estimates of procedure duration. Objective: The aim of this study was to determine whether a machine learning algorithm scalable across multiple centers could make estimations of case duration within a tolerance limit because there are substantial resources required for operating room functioning that relate to case duration. Methods: Deep learning, gradient boosting, and ensemble machine learning models were generated using perioperative data available at 3 distinct time points: the time of scheduling, the time of patient arrival to the operating or procedure room (primary model), and the time of surgical incision or procedure start. The primary outcome was procedure duration, defined by the time between the arrival and the departure of the patient from the procedure room. Model performance was assessed by mean absolute error (MAE), the proportion of predictions falling within 20\% of the actual duration, and other standard metrics. Performance was compared with a baseline method of historical means within a linear regression model. Model features driving predictions were assessed using Shapley additive explanations values and permutation feature importance. Results: A total of 1,177,893 procedures from 13 academic and private hospitals between 2016 and 2019 were used. Across all procedures, the median procedure duration was 94 (IQR 50-167) minutes. In estimating the procedure duration, the gradient boosting machine was the best-performing model, demonstrating an MAE of 34 (SD 47) minutes, with 46\% of the predictions falling within 20\% of the actual duration in the test data set. This represented a statistically and clinically significant improvement in predictions compared with a baseline linear regression model (MAE 43 min; P<.001; 39\% of the predictions falling within 20\% of the actual duration). The most important features in model training were historical procedure duration by surgeon, the word ``free'' within the procedure text, and the time of day. Conclusions: Nonlinear models using machine learning techniques may be used to generate high-performing, automatable, explainable, and scalable prediction models for procedure duration. ", doi="10.2196/44909", url="https://ai.jmir.org/2023/1/e44909" } @Article{info:doi/10.2196/46769, author="Yang, Chaoqi and Xiao, Cao and Westover, Brandon M. and Sun, Jimeng", title="Self-Supervised Electroencephalogram Representation Learning for Automatic Sleep Staging: Model Development and Evaluation Study", journal="JMIR AI", year="2023", month="Jul", day="26", volume="2", pages="e46769", keywords="physiological signals", keywords="electroencephalogram", keywords="EEG", keywords="sleep staging", keywords="sleep", keywords="predict", keywords="wearable devices", keywords="wearable", keywords="self-supervised learning", keywords="digital health", keywords="mHealth", keywords="mobile health", keywords="healthcare", keywords="health care", keywords="machine learning", abstract="Background: Deep learning models have shown great success in automating tasks in sleep medicine by learning from carefully annotated electroencephalogram (EEG) data. However, effectively using a large amount of raw EEG data remains a challenge. Objective: In this study, we aim to learn robust vector representations from massive unlabeled EEG signals, such that the learned vectorized features (1) are expressive enough to replace the raw signals in the sleep staging task, and (2) provide better predictive performance than supervised models in scenarios involving fewer labels and noisy samples. Methods: We propose a self-supervised model, Contrast with the World Representation (ContraWR), for EEG signal representation learning. Unlike previous models that use a set of negative samples, our model uses global statistics (ie, the average representation) from the data set to distinguish signals associated with different sleep stages. The ContraWR model is evaluated on 3 real-world EEG data sets that include both settings: at-home and in-laboratory EEG recording. Results: ContraWR outperforms 4 recently reported self-supervised learning methods on the sleep staging task across 3 large EEG data sets. ContraWR also supersedes supervised learning when fewer training labels are available (eg, 4\% accuracy improvement when less than 2\% of data are labeled on the Sleep EDF data set). Moreover, the model provides informative, representative feature structures in 2D projection. Conclusions: We show that ContraWR is robust to noise and can provide high-quality EEG representations for downstream prediction tasks. The proposed model can be generalized to other unsupervised physiological signal learning tasks. Future directions include exploring task-specific data augmentations and combining self-supervised methods with supervised methods, building upon the initial success of self-supervised learning reported in this study. ", doi="10.2196/46769", url="https://ai.jmir.org/2023/1/e46769", url="http://www.ncbi.nlm.nih.gov/pubmed/38090533" } @Article{info:doi/10.2196/45000, author="Sorbello, Alfred and Haque, Arefinul Syed and Hasan, Rashedul and Jermyn, Richard and Hussein, Ahmad and Vega, Alex and Zembrzuski, Krzysztof and Ripple, Anna and Ahadpour, Mitra", title="Artificial Intelligence--Enabled Software Prototype to Inform Opioid Pharmacovigilance From Electronic Health Records: Development and Usability Study", journal="JMIR AI", year="2023", month="Jul", day="18", volume="2", pages="e45000", keywords="electronic health records", keywords="pharmacovigilance", keywords="artificial intelligence", keywords="real world data", keywords="EHR", keywords="natural language", keywords="software application", keywords="drug", keywords="Food and Drug Administration", keywords="deep learning", abstract="Background: The use of patient health and treatment information captured in structured and unstructured formats in computerized electronic health record (EHR) repositories could potentially augment the detection of safety signals for drug products regulated by the US Food and Drug Administration (FDA). Natural language processing and other artificial intelligence (AI) techniques provide novel methodologies that could be leveraged to extract clinically useful information from EHR resources. Objective: Our aim is to develop a novel AI-enabled software prototype to identify adverse drug event (ADE) safety signals from free-text discharge summaries in EHRs to enhance opioid drug safety and research activities at the FDA. Methods: We developed a prototype for web-based software that leverages keyword and trigger-phrase searching with rule-based algorithms and deep learning to extract candidate ADEs for specific opioid drugs from discharge summaries in the Medical Information Mart for Intensive Care III (MIMIC III) database. The prototype uses MedSpacy components to identify relevant sections of discharge summaries and a pretrained natural language processing (NLP) model, Spark NLP for Healthcare, for named entity recognition. Fifteen FDA staff members provided feedback on the prototype's features and functionalities. Results: Using the prototype, we were able to identify known, labeled, opioid-related adverse drug reactions from text in EHRs. The AI-enabled model achieved accuracy, recall, precision, and F1-scores of 0.66, 0.69, 0.64, and 0.67, respectively. FDA participants assessed the prototype as highly desirable in user satisfaction, visualizations, and in the potential to support drug safety signal detection for opioid drugs from EHR data while saving time and manual effort. Actionable design recommendations included (1) enlarging the tabs and visualizations; (2) enabling more flexibility and customizations to fit end users' individual needs; (3) providing additional instructional resources; (4) adding multiple graph export functionality; and (5) adding project summaries. Conclusions: The novel prototype uses innovative AI-based techniques to automate searching for, extracting, and analyzing clinically useful information captured in unstructured text in EHRs. It increases efficiency in harnessing real-world data for opioid drug safety and increases the usability of the data to support regulatory review while decreasing the manual research burden. ", doi="10.2196/45000", url="https://ai.jmir.org/2023/1/e45000", url="http://www.ncbi.nlm.nih.gov/pubmed/37771410" } @Article{info:doi/10.2196/41818, author="Moon, Sungrim and He, Huan and Jia, Heling and Liu, Hongfang and Fan, Wilfred Jungwei", title="Extractive Clinical Question-Answering With Multianswer and Multifocus Questions: Data Set Development and Evaluation Study", journal="JMIR AI", year="2023", month="Jun", day="20", volume="2", pages="e41818", keywords="question-answering", keywords="information extraction", keywords="dataset", keywords="data set", keywords="artificial intelligence", keywords="natural language processing", abstract="Background: Extractive question-answering (EQA) is a useful natural language processing (NLP) application for answering patient-specific questions by locating answers in their clinical notes. Realistic clinical EQA can yield multiple answers to a single question and multiple focus points in 1 question, which are lacking in existing data sets for the development of artificial intelligence solutions. Objective: This study aimed to create a data set for developing and evaluating clinical EQA systems that can handle natural multianswer and multifocus questions. Methods: We leveraged the annotated relations from the 2018 National NLP Clinical Challenges corpus to generate an EQA data set. Specifically, the 1-to-N, M-to-1, and M-to-N drug-reason relations were included to form the multianswer and multifocus question-answering entries, which represent more complex and natural challenges in addition to the basic 1-drug-1-reason cases. A baseline solution was developed and tested on the data set. Results: The derived RxWhyQA data set contains 96,939 QA entries. Among the answerable questions, 25\% of them require multiple answers, and 2\% of them ask about multiple drugs within 1 question. Frequent cues were observed around the answers in the text, and 90\% of the drug and reason terms occurred within the same or an adjacent sentence. The baseline EQA solution achieved a best F1-score of 0.72 on the entire data set, and on specific subsets, it was 0.93 for the unanswerable questions, 0.48 for single-drug questions versus 0.60 for multidrug questions, and 0.54 for the single-answer questions versus 0.43 for multianswer questions. Conclusions: The RxWhyQA data set can be used to train and evaluate systems that need to handle multianswer and multifocus questions. Specifically, multianswer EQA appears to be challenging and therefore warrants more investment in research. We created and shared a clinical EQA data set with multianswer and multifocus questions that would channel future research efforts toward more realistic scenarios. ", doi="10.2196/41818", url="https://ai.jmir.org/2023/1/e41818", url="http://www.ncbi.nlm.nih.gov/pubmed/38875580" } @Article{info:doi/10.2196/44191, author="Pongdee, Thanai and Larson, B. Nicholas and Divekar, Rohit and Bielinski, J. Suzette and Liu, Hongfang and Moon, Sungrim", title="Automated Identification of Aspirin-Exacerbated Respiratory Disease Using Natural Language Processing and Machine Learning: Algorithm Development and Evaluation Study", journal="JMIR AI", year="2023", month="Jun", day="12", volume="2", pages="e44191", keywords="aspirin exacerbated respiratory disease", keywords="natural language processing", keywords="electronic health record", keywords="identification", keywords="machine learning", keywords="aspirin", keywords="asthma", keywords="respiratory illness", keywords="artificial intelligence", keywords="natural language processing algorithm", abstract="Background: Aspirin-exacerbated respiratory disease (AERD) is an acquired inflammatory condition characterized by the presence of asthma, chronic rhinosinusitis with nasal polyposis, and respiratory hypersensitivity reactions on ingestion of aspirin or other nonsteroidal anti-inflammatory drugs (NSAIDs). Despite AERD having a classic constellation of symptoms, the diagnosis is often overlooked, with an average of greater than 10 years between the onset of symptoms and diagnosis of AERD. Without a diagnosis, individuals will lack opportunities to receive effective treatments, such as aspirin desensitization or biologic medications. Objective: Our aim was to develop a combined algorithm that integrates both natural language processing (NLP) and machine learning (ML) techniques to identify patients with AERD from an electronic health record (EHR). Methods: A rule-based decision tree algorithm incorporating NLP-based features was developed using clinical documents from the EHR at Mayo Clinic. From clinical notes, using NLP techniques, 7 features were extracted that included the following: AERD, asthma, NSAID allergy, nasal polyps, chronic sinusitis, elevated urine leukotriene E4 level, and documented no-NSAID allergy. MedTagger was used to extract these 7 features from the unstructured clinical text given a set of keywords and patterns based on the chart review of 2 allergy and immunology experts for AERD. The status of each extracted feature was quantified by assigning the frequency of its occurrence in clinical documents per subject. We optimized the decision tree classifier's hyperparameters cutoff threshold on the training set to determine the representative feature combination to discriminate AERD. We then evaluated the resulting model on the test set. Results: The AERD algorithm, which combines NLP and ML techniques, achieved an area under the receiver operating characteristic curve score, sensitivity, and specificity of 0.86 (95\% CI 0.78-0.94), 80.00 (95\% CI 70.82-87.33), and 88.00 (95\% CI 79.98-93.64) for the test set, respectively. Conclusions: We developed a promising AERD algorithm that needs further refinement to improve AERD diagnosis. Continued development of NLP and ML technologies has the potential to reduce diagnostic delays for AERD and improve the health of our patients. ", doi="10.2196/44191", url="https://ai.jmir.org/2023/1/e44191" } @Article{info:doi/10.2196/42337, author="Wieland, Fluri and Nigg, Claudio", title="A Trainable Open-Source Machine Learning Accelerometer Activity Recognition Toolbox: Deep Learning Approach", journal="JMIR AI", year="2023", month="Jun", day="8", volume="2", pages="e42337", keywords="activity classification", keywords="deep learning", keywords="accelerometry", keywords="open source", keywords="activity recognition", keywords="machine learning", keywords="activity recorder", keywords="digital health application", keywords="smartphone app", keywords="deep learning algorithm", keywords="sensor device", abstract="Background: The accuracy of movement determination software in current activity trackers is insufficient for scientific applications, which are also not open-source. Objective: To address this issue, we developed an accurate, trainable, and open-source smartphone-based activity-tracking toolbox that consists of an Android app (HumanActivityRecorder) and 2 different deep learning algorithms that can be adapted to new behaviors. Methods: We employed a semisupervised deep learning approach to identify the different classes of activity based on accelerometry and gyroscope data, using both our own data and open competition data. Results: Our approach is robust against variation in sampling rate and sensor dimensional input and achieved an accuracy of around 87\% in classifying 6 different behaviors on both our own recorded data and the MotionSense data. However, if the dimension-adaptive neural architecture model is tested on our own data, the accuracy drops to 26\%, which demonstrates the superiority of our algorithm, which performs at 63\% on the MotionSense data used to train the dimension-adaptive neural architecture model. Conclusions: HumanActivityRecorder is a versatile, retrainable, open-source, and accurate toolbox that is continually tested on new data. This enables researchers to adapt to the behavior being measured and achieve repeatability in scientific studies. ", doi="10.2196/42337", url="https://ai.jmir.org/2023/1/e42337", url="http://www.ncbi.nlm.nih.gov/pubmed/38875548" } @Article{info:doi/10.2196/42884, author="Hsu, Enshuo and Bako, T. Abdulaziz and Potter, Thomas and Pan, P. Alan and Britz, W. Gavin and Tannous, Jonika and Vahidy, S. Farhaan", title="Extraction of Radiological Characteristics From Free-Text Imaging Reports Using Natural Language Processing Among Patients With Ischemic and Hemorrhagic Stroke: Algorithm Development and Validation", journal="JMIR AI", year="2023", month="Jun", day="6", volume="2", pages="e42884", keywords="natural language processing", keywords="deep learning", keywords="electronic health records", keywords="ischemic stroke", keywords="cerebral hemorrhage", keywords="neuroimaging", keywords="computed tomography", keywords="stroke", keywords="radiology", abstract="Background: Neuroimaging is the gold-standard diagnostic modality for all patients suspected of stroke. However, the unstructured nature of imaging reports remains a major challenge to extracting useful information from electronic health records systems. Despite the increasing adoption of natural language processing (NLP) for radiology reports, information extraction for many stroke imaging features has not been systematically evaluated. Objective: In this study, we propose an NLP pipeline, which adopts the state-of-the-art ClinicalBERT model with domain-specific pretraining and task-oriented fine-tuning to extract 13 stroke features from head computed tomography imaging notes. Methods: We used the model to generate structured data sets with information on the presence or absence of common stroke features for 24,924 patients with strokes. We compared the survival characteristics of patients with and without features of severe stroke (eg, midline shift, perihematomal edema, or mass effect) using the Kaplan-Meier curve and log-rank tests. Results: Pretrained on 82,073 head computed tomography notes with 13.7 million words and fine-tuned on 200 annotated notes, our HeadCT\_BERT model achieved an average area under receiver operating characteristic curve of 0.9831, F1-score of 0.8683, and accuracy of 97\%. Among patients with acute ischemic stroke, admissions with any severe stroke feature in initial imaging notes were associated with a lower probability of survival (P<.001). Conclusions: Our proposed NLP pipeline achieved high performance and has the potential to improve medical research and patient safety. ", doi="10.2196/42884", url="https://ai.jmir.org/2023/1/e42884", url="http://www.ncbi.nlm.nih.gov/pubmed/38875556" } @Article{info:doi/10.2196/44537, author="Lee, Kyeryoung and Liu, Zongzhi and Chandran, Urmila and Kalsekar, Iftekhar and Laxmanan, Balaji and Higashi, K. Mitchell and Jun, Tomi and Ma, Meng and Li, Minghao and Mai, Yun and Gilman, Christopher and Wang, Tongyu and Ai, Lei and Aggarwal, Parag and Pan, Qi and Oh, William and Stolovitzky, Gustavo and Schadt, Eric and Wang, Xiaoyan", title="Detecting Ground Glass Opacity Features in Patients With Lung Cancer: Automated Extraction and Longitudinal Analysis via Deep Learning--Based Natural Language Processing", journal="JMIR AI", year="2023", month="Jun", day="1", volume="2", pages="e44537", keywords="natural language processing", keywords="ground glass opacity", keywords="real world data", keywords="radiology notes", keywords="longitudinal analysis", keywords="deep learning", keywords="bidirectional long short-term memory (Bi-LSTM)", keywords="conditional random fields (CRF)", abstract="Background: Ground-glass opacities (GGOs) appearing in computed tomography (CT) scans may indicate potential lung malignancy. Proper management of GGOs based on their features can prevent the development of lung cancer. Electronic health records are rich sources of information on GGO nodules and their granular features, but most of the valuable information is embedded in unstructured clinical notes. Objective: We aimed to develop, test, and validate a deep learning--based natural language processing (NLP) tool that automatically extracts GGO features to inform the longitudinal trajectory of GGO status from large-scale radiology notes. Methods: We developed a bidirectional long short-term memory with a conditional random field--based deep-learning NLP pipeline to extract GGO and granular features of GGO retrospectively from radiology notes of 13,216 lung cancer patients. We evaluated the pipeline with quality assessments and analyzed cohort characterization of the distribution of nodule features longitudinally to assess changes in size and solidity over time. Results: Our NLP pipeline built on the GGO ontology we developed achieved between 95\% and 100\% precision, 89\% and 100\% recall, and 92\% and 100\% F1-scores on different GGO features. We deployed this GGO NLP model to extract and structure comprehensive characteristics of GGOs from 29,496 radiology notes of 4521 lung cancer patients. Longitudinal analysis revealed that size increased in 16.8\% (240/1424) of patients, decreased in 14.6\% (208/1424), and remained unchanged in 68.5\% (976/1424) in their last note compared to the first note. Among 1127 patients who had longitudinal radiology notes of GGO status, 815 (72.3\%) were reported to have stable status, and 259 (23\%) had increased/progressed status in the subsequent notes. Conclusions: Our deep learning--based NLP pipeline can automatically extract granular GGO features at scale from electronic health records when this information is documented in radiology notes and help inform the natural history of GGO. This will open the way for a new paradigm in lung cancer prevention and early detection. ", doi="10.2196/44537", url="https://ai.jmir.org/2023/1/e44537" } @Article{info:doi/10.2196/45450, author="Chan, Berin Nicholas and Li, Weizi and Aung, Theingi and Bazuaye, Eghosa and Montero, M. Rosa", title="Machine Learning--Based Time in Patterns for Blood Glucose Fluctuation Pattern Recognition in Type 1 Diabetes Management: Development and Validation Study", journal="JMIR AI", year="2023", month="May", day="26", volume="2", pages="e45450", keywords="diabetes mellitus", keywords="continuous glucose monitoring", keywords="glycemic variability", keywords="glucose fluctuation pattern", keywords="temporal clustering", keywords="scalable metrics", abstract="Background: Continuous glucose monitoring (CGM) for diabetes combines noninvasive glucose biosensors, continuous monitoring, cloud computing, and analytics to connect and simulate a hospital setting in a person's home. CGM systems inspired analytics methods to measure glycemic variability (GV), but existing GV analytics methods disregard glucose trends and patterns; hence, they fail to capture entire temporal patterns and do not provide granular insights about glucose fluctuations. Objective: This study aimed to propose a machine learning--based framework for blood glucose fluctuation pattern recognition, which enables a more comprehensive representation of GV profiles that could present detailed fluctuation information, be easily understood by clinicians, and provide insights about patient groups based on time in blood fluctuation patterns. Methods: Overall, 1.5 million measurements from 126 patients in the United Kingdom with type 1 diabetes mellitus (T1DM) were collected, and prevalent blood fluctuation patterns were extracted using dynamic time warping. The patterns were further validated in 225 patients in the United States with T1DM. Hierarchical clustering was then applied on time in patterns to form 4 clusters of patients. Patient groups were compared using statistical analysis. Results: In total, 6 patterns depicting distinctive glucose levels and trends were identified and validated, based on which 4 GV profiles of patients with T1DM were found. They were significantly different in terms of glycemic statuses such as diabetes duration (P=.04), glycated hemoglobin level (P<.001), and time in range (P<.001) and thus had different management needs. Conclusions: The proposed method can analytically extract existing blood fluctuation patterns from CGM data. Thus, time in patterns can capture a rich view of patients' GV profile. Its conceptual resemblance with time in range, along with rich blood fluctuation details, makes it more scalable, accessible, and informative to clinicians. ", doi="10.2196/45450", url="https://ai.jmir.org/2023/1/e45450" } @Article{info:doi/10.2196/44779, author="Naseri, Hossein and Skamene, Sonia and Tolba, Marwan and Faye, Daro Mame and Ramia, Paul and Khriguian, Julia and David, Marc and Kildea, John", title="A Scalable Radiomics- and Natural Language Processing--Based Machine Learning Pipeline to Distinguish Between Painful and Painless Thoracic Spinal Bone Metastases: Retrospective Algorithm Development and Validation Study", journal="JMIR AI", year="2023", month="May", day="22", volume="2", pages="e44779", keywords="cancer", keywords="pain", keywords="palliative care", keywords="radiotherapy", keywords="bone metastases", keywords="radiomics", keywords="natural language processing", keywords="machine learning", keywords="artificial intelligent", keywords="radiation therapy", abstract="Background: The identification of objective pain biomarkers can contribute to an improved understanding of pain, as well as its prognosis and better management. Hence, it has the potential to improve the quality of life of patients with cancer. Artificial intelligence can aid in the extraction of objective pain biomarkers for patients with cancer with bone metastases (BMs). Objective: This study aimed to develop and evaluate a scalable natural language processing (NLP)-- and radiomics-based machine learning pipeline to differentiate between painless and painful BM lesions in simulation computed tomography (CT) images using imaging features (biomarkers) extracted from lesion center point--based regions of interest (ROIs). Methods: Patients treated at our comprehensive cancer center who received palliative radiotherapy for thoracic spine BM between January 2016 and September 2019 were included in this retrospective study. Physician-reported pain scores were extracted automatically from radiation oncology consultation notes using an NLP pipeline. BM center points were manually pinpointed on CT images by radiation oncologists. Nested ROIs with various diameters were automatically delineated around these expert-identified BM center points, and radiomics features were extracted from each ROI. Synthetic Minority Oversampling Technique resampling, the Least Absolute Shrinkage And Selection Operator feature selection method, and various machine learning classifiers were evaluated using precision, recall, F1-score, and area under the receiver operating characteristic curve. Results: Radiation therapy consultation notes and simulation CT images of 176 patients (mean age 66, SD 14 years; 95 males) with thoracic spine BM were included in this study. After BM center point identification, 107 radiomics features were extracted from each spherical ROI using pyradiomics. Data were divided into 70\% and 30\% training and hold-out test sets, respectively. In the test set, the accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve of our best performing model (neural network classifier on an ensemble ROI) were 0.82 (132/163), 0.59 (16/27), 0.85 (116/136), and 0.83, respectively. Conclusions: Our NLP- and radiomics-based machine learning pipeline was successful in differentiating between painful and painless BM lesions. It is intrinsically scalable by using NLP to extract pain scores from clinical notes and by requiring only center points to identify BM lesions in CT images. ", doi="10.2196/44779", url="https://ai.jmir.org/2023/1/e44779", url="http://www.ncbi.nlm.nih.gov/pubmed/38875572" } @Article{info:doi/10.2196/41868, author="Bozorgmehr, Arezoo and Weltermann, Birgitta", title="Prediction of Chronic Stress and Protective Factors in Adults: Development of an Interpretable Prediction Model Based on XGBoost and SHAP Using National Cross-sectional DEGS1 Data", journal="JMIR AI", year="2023", month="May", day="16", volume="2", pages="e41868", keywords="artificial intelligence", keywords="machine learning", keywords="prognostic", keywords="model", keywords="chronic stress", keywords="resilience factors", keywords="interpretable model", keywords="explainability", keywords="stress", keywords="disease", keywords="diabetes", keywords="cancer", keywords="dataset", keywords="clinical", keywords="data", keywords="gender", keywords="social support", keywords="support", keywords="intervention", keywords="SHAP", abstract="Background: Chronic stress is highly prevalent in the German population. It has known adverse effects on mental health, such as burnout and depression. Known long-term effects of chronic stress are cardiovascular disease, diabetes, and cancer. Objective: This study aims to derive an interpretable multiclass machine learning model for predicting chronic stress levels and factors protecting against chronic stress based on representative nationwide data from the German Health Interview and Examination Survey for Adults, which is part of the national health monitoring program. Methods: A data set from the German Health Interview and Examination Survey for Adults study including demographic, clinical, and laboratory data from 5801 participants was analyzed. A multiclass eXtreme Gradient Boosting (XGBoost) model was constructed to classify participants into 3 categories including low, middle, and high chronic stress levels. The model's performance was evaluated using the area under the receiver operating characteristic curve, precision, recall, specificity, and the F1-score. Additionally, SHapley Additive exPlanations was used to interpret the prediction XGBoost model and to identify factors protecting against chronic stress. Results: The multiclass XGBoost model exhibited the macroaverage scores, with an area under the receiver operating characteristic curve of 81\%, precision of 63\%, recall of 52\%, specificity of 78\%, and F1-score of 54\%. The most important features for low-level chronic stress were male gender, very good general health, high satisfaction with living space, and strong social support. Conclusions: This study presents a multiclass interpretable prediction model for chronic stress in adults in Germany. The explainable artificial intelligence technique SHapley Additive exPlanations identified relevant protective factors for chronic stress, which need to be considered when developing interventions to reduce chronic stress. ", doi="10.2196/41868", url="https://ai.jmir.org/2023/1/e41868", url="http://www.ncbi.nlm.nih.gov/pubmed/38875576" } @Article{info:doi/10.2196/44293, author="Oniani, David and Chandrasekar, Premkumar and Sivarajkumar, Sonish and Wang, Yanshan", title="Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study", journal="JMIR AI", year="2023", month="May", day="4", volume="2", pages="e44293", keywords="few-shot learning", keywords="FSL", keywords="Siamese neural network", keywords="SNN", keywords="natural language processing", keywords="NLP", keywords="neural networks", abstract="Background: Natural language processing (NLP) has become an emerging technology in health care that leverages a large amount of free-text data in electronic health records to improve patient care, support clinical decisions, and facilitate clinical and translational science research. Recently, deep learning has achieved state-of-the-art performance in many clinical NLP tasks. However, training deep learning models often requires large, annotated data sets, which are normally not publicly available and can be time-consuming to build in clinical domains. Working with smaller annotated data sets is typical in clinical NLP; therefore, ensuring that deep learning models perform well is crucial for real-world clinical NLP applications. A widely adopted approach is fine-tuning existing pretrained language models, but these attempts fall short when the training data set contains only a few annotated samples. Few-shot learning (FSL) has recently been investigated to tackle this problem. Siamese neural network (SNN) has been widely used as an FSL approach in computer vision but has not been studied well in NLP. Furthermore, the literature on its applications in clinical domains is scarce. Objective: The aim of our study is to propose and evaluate SNN-based approaches for few-shot clinical NLP tasks. Methods: We propose 2 SNN-based FSL approaches, including pretrained SNN and SNN with second-order embeddings. We evaluate the proposed approaches on the clinical sentence classification task. We experiment with 3 few-shot settings, including 4-shot, 8-shot, and 16-shot learning. The clinical NLP task is benchmarked using the following 4 pretrained language models: bidirectional encoder representations from transformers (BERT), BERT for biomedical text mining (BioBERT), BioBERT trained on clinical notes (BioClinicalBERT), and generative pretrained transformer 2 (GPT-2). We also present a performance comparison between SNN-based approaches and the prompt-based GPT-2 approach. Results: In 4-shot sentence classification tasks, GPT-2 had the highest precision (0.63), but its recall (0.38) and F score (0.42) were lower than those of BioBERT-based pretrained SNN (0.45 and 0.46, respectively). In both 8-shot and 16-shot settings, SNN-based approaches outperformed GPT-2 in all 3 metrics of precision, recall, and F score. Conclusions: The experimental results verified the effectiveness of the proposed SNN approaches for few-shot clinical NLP tasks. ", doi="10.2196/44293", url="https://ai.jmir.org/2023/1/e44293", url="http://www.ncbi.nlm.nih.gov/pubmed/38875537" } @Article{info:doi/10.2196/40755, author="Steiger, Edgar and Kroll, Eric Lars", title="Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework", journal="JMIR AI", year="2023", month="Apr", day="21", volume="2", pages="e40755", keywords="electronic health records", keywords="ICD", keywords="machine learning", keywords="health care", keywords="data", keywords="diagnosis", keywords="model", keywords="drug", keywords="drug prescription", keywords="performance", keywords="applications", keywords="quality", keywords="prevention", abstract="Background: In health care, diagnosis codes in claims data and electronic health records (EHRs) play an important role in data-driven decision making. Any analysis that uses a patient's diagnosis codes to predict future outcomes or describe morbidity requires a numerical representation of this diagnosis profile made up of string-based diagnosis codes. These numerical representations are especially important for machine learning models. Most commonly, binary-encoded representations have been used, usually for a subset of diagnoses. In real-world health care applications, several issues arise: patient profiles show high variability even when the underlying diseases are the same, they may have gaps and not contain all available information, and a large number of appropriate diagnoses must be considered. Objective: We herein present Pat2Vec, a self-supervised machine learning framework inspired by neural network--based natural language processing that embeds complete diagnosis profiles into a small real-valued numerical vector. Methods: Based on German outpatient claims data with diagnosis codes according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10), we discovered an optimal vectorization embedding model for patient diagnosis profiles with Bayesian optimization for the hyperparameters. The calibration process ensured a robust embedding model for health care--relevant tasks by aggregating the metrics of different regression and classification tasks using different machine learning algorithms (linear and logistic regression as well as gradient-boosted trees). The models were tested against a baseline model that binary encodes the most common diagnoses. The study used diagnosis profiles and supplementary data from more than 10 million patients from 2016 to 2019 and was based on the largest German ambulatory claims data set. To describe subpopulations in health care, we identified clusters (via density-based clustering) and visualized patient vectors in 2D (via dimensionality reduction with uniform manifold approximation). Furthermore, we applied our vectorization model to predict prospective drug prescription costs based on patients' diagnoses. Results: Our final models outperform the baseline model (binary encoding) with equal dimensions. They are more robust to missing data and show large performance gains, particularly in lower dimensions, demonstrating the embedding model's compression of nonlinear information. In the future, other sources of health care data can be integrated into the current diagnosis-based framework. Other researchers can apply our publicly shared embedding model to their own diagnosis data. Conclusions: We envision a wide range of applications for Pat2Vec that will improve health care quality, including personalized prevention and signal detection in patient surveillance as well as health care resource planning based on subcohorts identified by our data-driven machine learning framework. ", doi="10.2196/40755", url="https://ai.jmir.org/2023/1/e40755" } @Article{info:doi/10.2196/40702, author="Duh, Montserrat Maria and Torra-Ferrer, Neus and Riera-Mar{\'i}n, Meritxell and Cumelles, D{\'i}dac and Rodr{\'i}guez-Comas, J{\'u}lia and Garc{\'i}a L{\'o}pez, Javier and Fern{\'a}ndez Planas, Teresa M. {\textordfeminine}", title="Deep Learning to Detect Pancreatic Cystic Lesions on Abdominal Computed Tomography Scans: Development and Validation Study", journal="JMIR AI", year="2023", month="Mar", day="17", volume="2", pages="e40702", keywords="deep learning", keywords="pancreatic cystic lesion", keywords="neural networks", keywords="precursor lesions", keywords="pancreatic cancer", keywords="computed tomography", keywords="magnetic resonance", keywords="cancer", keywords="radiologist", keywords="technology", abstract="Background: Pancreatic cystic lesions (PCLs) are frequent and underreported incidental findings on computed tomography (CT) scans and can evolve to pancreatic cancer---the most lethal cancer, with less than 5 months of life expectancy. Objective: The aim of this study was to develop and validate an artificial deep neural network (attention gate U-Net, also named ``AGNet'') for automated detection of PCLs. This kind of technology can help radiologists to cope with an increasing demand of cross-sectional imaging tests and increase the number of PCLs incidentally detected, thus increasing the early detection of pancreatic cancer. Methods: We adapted and evaluated an algorithm based on an attention gate U-Net architecture for automated detection of PCL on CTs. A total of 335 abdominal CTs with PCLs and control cases were manually segmented in 3D by 2 radiologists with over 10 years of experience in consensus with a board-certified radiologist specialized in abdominal radiology. This information was used to train a neural network for segmentation followed by a postprocessing pipeline that filtered the results of the network and applied some physical constraints, such as the expected position of the pancreas, to minimize the number of false positives. Results: Of 335 studies included in this study, 297 had a PCL, including serous cystadenoma, intraductal pseudopapillary mucinous neoplasia, mucinous cystic neoplasm, and pseudocysts . The Shannon Index of the chosen data set was 0.991 with an evenness of 0.902. The mean sensitivity obtained in the detection of these lesions was 93.1\% (SD 0.1\%), and the specificity was 81.8\% (SD 0.1\%). Conclusions: This study shows a good performance of an automated artificial deep neural network in the detection of PCL on both noncontrast- and contrast-enhanced abdominal CT scans. ", doi="10.2196/40702", url="https://ai.jmir.org/2023/1/e40702", url="http://www.ncbi.nlm.nih.gov/pubmed/38875547" } @Article{info:doi/10.2196/41264, author="Nurmambetova, Elvira and Pan, Jie and Zhang, Zilong and Wu, Guosong and Lee, Seungwon and Southern, A. Danielle and Martin, A. Elliot and Ho, Chester and Xu, Yuan and Eastwood, A. Cathy", title="Developing an Inpatient Electronic Medical Record Phenotype for Hospital-Acquired Pressure Injuries: Case Study Using Natural Language Processing Models", journal="JMIR AI", year="2023", month="Mar", day="8", volume="2", pages="e41264", keywords="pressure injury", keywords="natural language processing", keywords="NLP", keywords="algorithm", keywords="phenotype algorithm", keywords="phenotyping algorithm", keywords="machine learning", keywords="electronic medical record", keywords="EMR", keywords="pressure sore", keywords="pressure wound", keywords="pressure ulcer", keywords="pressure injuries", keywords="detect", abstract="Background: Surveillance of hospital-acquired pressure injuries (HAPI) is often suboptimal when relying on administrative health data, as International Classification of Diseases (ICD) codes are known to have long delays and are undercoded. We leveraged natural language processing (NLP) applications on free-text notes, particularly the inpatient nursing notes, from electronic medical records (EMRs), to more accurately and timely identify HAPIs. Objective: This study aimed to show that EMR-based phenotyping algorithms are more fitted to detect HAPIs than ICD-10-CA algorithms alone, while the clinical logs are recorded with higher accuracy via NLP using nursing notes. Methods: Patients with HAPIs were identified from head-to-toe skin assessments in a local tertiary acute care hospital during a clinical trial that took place from 2015 to 2018 in Calgary, Alberta, Canada. Clinical notes documented during the trial were extracted from the EMR database after the linkage with the discharge abstract database. Different combinations of several types of clinical notes were processed by sequential forward selection during the model development. Text classification algorithms for HAPI detection were developed using random forest (RF), extreme gradient boosting (XGBoost), and deep learning models. The classification threshold was tuned to enable the model to achieve similar specificity to an ICD-based phenotyping study. Each model's performance was assessed, and comparisons were made between the metrics, including sensitivity, positive predictive value, negative predictive value, and F1-score. Results: Data from 280 eligible patients were used in this study, among whom 97 patients had HAPIs during the trial. RF was the optimal performing model with a sensitivity of 0.464 (95\% CI 0.365-0.563), specificity of 0.984 (95\% CI 0.965-1.000), and F1-score of 0.612 (95\% CI of 0.473-0.751). The machine learning (ML) model reached higher sensitivity without sacrificing much specificity compared to the previously reported performance of ICD-based algorithms. Conclusions: The EMR-based NLP phenotyping algorithms demonstrated improved performance in HAPI case detection over ICD-10-CA codes alone. Daily generated nursing notes in EMRs are a valuable data resource for ML models to accurately detect adverse events. The study contributes to enhancing automated health care quality and safety surveillance. ", doi="10.2196/41264", url="https://ai.jmir.org/2023/1/e41264" } @Article{info:doi/10.2196/40167, author="Sekandi, Nabbuye Juliet and Shi, Weili and Zhu, Ronghang and Kaggwa, Patrick and Mwebaze, Ernest and Li, Sheng", title="Application of Artificial Intelligence to the Monitoring of Medication Adherence for Tuberculosis Treatment in Africa: Algorithm Development and Validation", journal="JMIR AI", year="2023", month="Feb", day="23", volume="2", pages="e40167", keywords="artificial intelligence", keywords="deep learning", keywords="machine learning", keywords="medication adherence", keywords="digital technology", keywords="digital health", keywords="tuberculosis", keywords="video directly observed therapy", keywords="video therapy", abstract="Background: Artificial intelligence (AI) applications based on advanced deep learning methods in image recognition tasks can increase efficiency in the monitoring of medication adherence through automation. AI has sparsely been evaluated for the monitoring of medication adherence in clinical settings. However, AI has the potential to transform the way health care is delivered even in limited-resource settings such as Africa. Objective: We aimed to pilot the development of a deep learning model for simple binary classification and confirmation of proper medication adherence to enhance efficiency in the use of video monitoring of patients in tuberculosis treatment. Methods: We used a secondary data set of 861 video images of medication intake that were collected from consenting adult patients with tuberculosis in an institutional review board--approved study evaluating video-observed therapy in Uganda. The video images were processed through a series of steps to prepare them for use in a training model. First, we annotated videos using a specific protocol to eliminate those with poor quality. After the initial annotation step, 497 videos had sufficient quality for training the models. Among them, 405 were positive samples, whereas 92 were negative samples. With some preprocessing techniques, we obtained 160 frames with a size of 224 {\texttimes} 224 in each video. We used a deep learning framework that leveraged 4 convolutional neural networks models to extract visual features from the video frames and automatically perform binary classification of adherence or nonadherence. We evaluated the diagnostic properties of the different models using sensitivity, specificity, F1-score, and precision. The area under the curve (AUC) was used to assess the discriminative performance and the speed per video review as a metric for model efficiency. We conducted a 5-fold internal cross-validation to determine the diagnostic and discriminative performance of the models. We did not conduct external validation due to a lack of publicly available data sets with specific medication intake video frames. Results: Diagnostic properties and discriminative performance from internal cross-validation were moderate to high in the binary classification tasks with 4 selected automated deep learning models. The sensitivity ranged from 92.8 to 95.8\%, specificity from 43.5 to 55.4\%, F1-score from 0.91 to 0.92, precision from 88\% to 90.1\%, and AUC from 0.78 to 0.85. The 3D ResNet model had the highest precision, AUC, and speed. Conclusions: All 4 deep learning models showed comparable diagnostic properties and discriminative performance. The findings serve as a reasonable proof of concept to support the potential application of AI in the binary classification of video frames to predict medication adherence. ", doi="10.2196/40167", url="https://ai.jmir.org/2023/1/e40167", url="http://www.ncbi.nlm.nih.gov/pubmed/38464947" } @Article{info:doi/10.2196/42253, author="Bowers, Anne and Drake, Chelsea and Makarkin, E. Alexi and Monzyk, Robert and Maity, Biswajit and Telle, Andrew", title="Predicting Patient Mortality for Earlier Palliative Care Identification in Medicare Advantage Plans: Features of a Machine Learning Model", journal="JMIR AI", year="2023", month="Feb", day="20", volume="2", pages="e42253", keywords="palliative", keywords="palliative care", keywords="machine learning", keywords="social determinants", keywords="Medicare Advantage", keywords="Medicare", keywords="predict", keywords="algorithm", keywords="mortality", keywords="older adult", abstract="Background: Machine learning (ML) can offer greater precision and sensitivity in predicting Medicare patient end of life and potential need for palliative services compared to provider recommendations alone. However, earlier ML research on older community dwelling Medicare beneficiaries has provided insufficient exploration of key model feature impacts and the role of the social determinants of health. Objective: This study describes the development of a binary classification ML model predicting 1-year mortality among Medicare Advantage plan members aged ?65 years (N=318,774) and further examines the top features of the predictive model. Methods: A light gradient-boosted trees model configuration was selected based on 5-fold cross-validation. The model was trained with 80\% of cases (n=255,020) using randomized feature generation periods, with 20\% (n=63,754) reserved as a holdout for validation. The final algorithm used 907 feature inputs extracted primarily from claims and administrative data capturing patient diagnoses, service utilization, demographics, and census tract--based social determinants index measures. Results: The total sample had an actual mortality prevalence of 3.9\% in the 2018 outcome period. The final model correctly predicted 44.2\% of patient expirations among the top 1\% of highest risk members (AUC=0.84; 95\% CI 0.83-0.85) versus 24.0\% predicted by the model iteration using only age, gender, and select high-risk utilization features (AUC=0.74; 95\% CI 0.73-0.74). The most important algorithm features included patient demographics, diagnoses, pharmacy utilization, mean costs, and certain social determinants of health. Conclusions: The final ML model better predicts Medicare Advantage member end of life using a variety of routinely collected data and supports earlier patient identification for palliative care. ", doi="10.2196/42253", url="https://ai.jmir.org/2023/1/e42253" } @Article{info:doi/10.2196/40843, author="Chenais, Gabrielle and Gil-Jardin{\'e}, C{\'e}dric and Touchais, H{\'e}l{\`e}ne and Avalos Fernandez, Marta and Contrand, Benjamin and Tellier, Eric and Combes, Xavier and Bourdois, Loick and Revel, Philippe and Lagarde, Emmanuel", title="Deep Learning Transformer Models for Building a Comprehensive and Real-time Trauma Observatory: Development and Validation Study", journal="JMIR AI", year="2023", month="Jan", day="12", volume="2", pages="e40843", keywords="deep learning", keywords="public health", keywords="trauma", keywords="emergencies", keywords="natural language processing", keywords="transformers", abstract="Background: Public health surveillance relies on the collection of data, often in near-real time. Recent advances in natural language processing make it possible to envisage an automated system for extracting information from electronic health records. Objective: To study the feasibility of setting up a national trauma observatory in France, we compared the performance of several automatic language processing methods in a multiclass classification task of unstructured clinical notes. Methods: A total of 69,110 free-text clinical notes related to visits to the emergency departments of the University Hospital of Bordeaux, France, between 2012 and 2019 were manually annotated. Among these clinical notes, 32.5\% (22,481/69,110) were traumas. We trained 4 transformer models (deep learning models that encompass attention mechanism) and compared them with the term frequency--inverse document frequency associated with the support vector machine method. Results: The transformer models consistently performed better than the term frequency--inverse document frequency and a support vector machine. Among the transformers, the GPTanam model pretrained with a French corpus with an additional autosupervised learning step on 306,368 unlabeled clinical notes showed the best performance with a micro F1-score of 0.969. Conclusions: The transformers proved efficient at the multiclass classification of narrative and medical data. Further steps for improvement should focus on the expansion of abbreviations and multioutput multiclass classification. ", doi="10.2196/40843", url="https://ai.jmir.org/2023/1/e40843", url="http://www.ncbi.nlm.nih.gov/pubmed/38875539" } @Article{info:doi/10.2196/41030, author="Lee, Chanjung and Jo, Brian and Woo, Hyunki and Im, Yoori and Park, Woong Rae and Park, ChulHyoung", title="Chronic Disease Prediction Using the Common Data Model: Development Study", journal="JMIR AI", year="2022", month="Dec", day="22", volume="1", number="1", pages="e41030", keywords="common data model", keywords="chronic disease", keywords="prediction model", keywords="machine learning", keywords="disease management", keywords="data model", keywords="disease prediction", keywords="prediction", keywords="risk prediction", keywords="risk factors", keywords="health risk", abstract="Background: Chronic disease management is a major health issue worldwide. With the paradigm shift to preventive medicine, disease prediction modeling using machine learning is gaining importance for precise and accurate medical judgement. Objective: This study aimed to develop high-performance prediction models for 4 chronic diseases using the common data model (CDM) and machine learning and to confirm the possibility for the extension of the proposed models. Methods: In this study, 4 major chronic diseases---namely, diabetes, hypertension, hyperlipidemia, and cardiovascular disease---were selected, and a model for predicting their occurrence within 10 years was developed. For model development, the Atlas analysis tool was used to define the chronic disease to be predicted, and data were extracted from the CDM according to the defined conditions. A model for predicting each disease was built with 4 algorithms verified in previous studies, and the performance was compared after applying a grid search. Results: For the prediction of each disease, we applied 4 algorithms (logistic regression, gradient boosting, random forest, and extreme gradient boosting), and all models show greater than 80\% accuracy. As compared to the optimized model's performance, extreme gradient boosting presented the highest predictive performance for the 4 diseases (diabetes, hypertension, hyperlipidemia, and cardiovascular disease) with 80\% or greater and from 0.84 to 0.93 in area under the curve standards. Conclusions: This study demonstrates the possibility for the preemptive management of chronic diseases by predicting the occurrence of chronic diseases using the CDM and machine learning. With these models, the risk of developing major chronic diseases within 10 years can be demonstrated by identifying health risk factors using our chronic disease prediction machine learning model developed with the real-world data--based CDM and National Health Insurance Corporation examination data that individuals can easily obtain. ", doi="10.2196/41030", url="https://ai.jmir.org/2022/1/e41030", url="http://www.ncbi.nlm.nih.gov/pubmed/38875545" } @Article{info:doi/10.2196/37508, author="Chen, Kun-Hui and Yang, Chih-Yu and Wang, Hsin-Yi and Ma, Hsiao-Li and Lee, Kuang-Sheng Oscar", title="Artificial Intelligence--Assisted Diagnosis of Anterior Cruciate Ligament Tears From Magnetic Resonance Images: Algorithm Development and Validation Study", journal="J AI", year="2022", month="Jul", day="26", volume="1", number="1", pages="e37508", keywords="artificial intelligence", keywords="convolutional neural network", keywords="magnetic resonance imaging", keywords="MRI", keywords="deep learning", keywords="anterior cruciate ligament", keywords="sports medicine", keywords="machine learning", keywords="ligament", keywords="sport", keywords="diagnosis", keywords="tear", keywords="damage", keywords="imaging", keywords="development", keywords="validation", keywords="algorithm", abstract="Background: Anterior cruciate ligament (ACL) injuries are common in sports and are critical knee injuries that require prompt diagnosis. Magnetic resonance imaging (MRI) is a strong, noninvasive tool for detecting ACL tears, which requires training to read accurately. Clinicians with different experiences in reading MR images require different information for the diagnosis of ACL tears. Artificial intelligence (AI) image processing could be a promising approach in the diagnosis of ACL tears. Objective: This study sought to use AI to (1) diagnose ACL tears from complete MR images, (2) identify torn-ACL images from complete MR images with a diagnosis of ACL tears, and (3) differentiate intact-ACL and torn-ACL MR images from the selected MR images. Methods: The sagittal MR images of torn ACL (n=1205) and intact ACL (n=1018) from 800 cases and the complete knee MR images of 200 cases (100 torn ACL and 100 intact ACL) from patients aged 20-40 years were retrospectively collected. An AI approach using a convolutional neural network was applied to build models for the objective. The MR images of 200 independent cases (100 torn ACL and 100 intact ACL) were used as the test set for the models. The MR images of 40 randomly selected cases from the test set were used to compare the reading accuracy of ACL tears between the trained model and clinicians with different levels of experience. Results: The first model differentiated between torn-ACL, intact-ACL, and other images from complete MR images with an accuracy of 0.9946, and the sensitivity, specificity, precision, and F1-score were 0.9344, 0.9743, 0.8659, and 0.8980, respectively. The final accuracy for ACL-tear diagnosis was 0.96. The model showed a significantly higher reading accuracy than less experienced clinicians. The second model identified torn-ACL images from complete MR images with a diagnosis of ACL tear with an accuracy of 0.9943, and the sensitivity, specificity, precision, and F1-score were 0.9154, 0.9660, 0.8167, and 0.8632, respectively. The third model differentiated torn- and intact-ACL images with an accuracy of 0.9691, and the sensitivity, specificity, precision, and F1-score were 0.9827, 0.9519, 0.9632, and 0.9728, respectively. Conclusions: This study demonstrates the feasibility of using an AI approach to provide information to clinicians who need different information from MRI to diagnose ACL tears. ", doi="10.2196/37508", url="https://ai.jmir.org/2022/1/e37508", url="http://www.ncbi.nlm.nih.gov/pubmed/38875555" }