%0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e66273 %T Lessons Learned From European Health Data Projects With Cancer Use Cases: Implementation of Health Standards and Internet of Things Semantic Interoperability %A Gyrard,Amelie %A Abedian,Somayeh %A Gribbon,Philip %A Manias,George %A van Nuland,Rick %A Zatloukal,Kurt %A Nicolae,Irina Emilia %A Danciu,Gabriel %A Nechifor,Septimiu %A Marti-Bonmati,Luis %A Mallol,Pedro %A Dalmiani,Stefano %A Autexier,Serge %A Jendrossek,Mario %A Avramidis,Ioannis %A Garcia Alvarez,Eva %A Holub,Petr %A Blanquer,Ignacio %A Boden,Anna %A Hussein,Rada %+ Trialog, 25 rue du Général Foy, Paris, 75008, France, 33 033 1 44 70 61, amelie.gyrard@trialog.com %K artificial intelligence %K cancer %K European Health Data Space %K health care standards %K interoperability %K AI %K health data %K cancer use cases %K IoT %K Internet of Things %K primary data %K diagnosis %K prognosis %K decision-making %D 2025 %7 24.3.2025 %9 Viewpoint %J J Med Internet Res %G English %X The adoption of the European Health Data Space (EHDS) regulation has made integrating health data critical for both primary and secondary applications. Primary use cases include patient diagnosis, prognosis, and treatment, while secondary applications support research, innovation, and regulatory decision-making. Additionally, leveraging large datasets improves training quality for artificial intelligence (AI) models, particularly in cancer prevention, prediction, and treatment personalization. The European Union (EU) has recently funded multiple projects under Europe’s Beating Cancer Plan. However, these projects face challenges related to fragmentation and the lack of standardization in metadata, data storage, access, and processing. This paper examines interoperability standards used in six EU-funded cancer-related projects: IDERHA (Integration of Heterogeneous Data and Evidence Towards Regulatory and Health Technology Assessments Acceptance), EUCAIM (European Cancer Imaging Initiative), ASCAPE (Artificial Intelligence Supporting Cancer Patients Across Europe), iHelp, BigPicture, and the HealthData@EU pilot. These initiatives aim to enhance the analysis of heterogeneous health data while aligning with EHDS implementation, specifically for the EHDS for the secondary use of data (EHDS2). Between October 2023 and July 2024, we organized meetings and workshops among these projects to assess how they adopt health standards and apply Internet of Things (IoT) semantic interoperability. The discussions focused on interoperability standards for health data, knowledge graphs, the data quality framework, patient-generated health data, AI reasoning, federated approaches, security, and privacy. Based on our findings, we developed a template for designing the EHDS2 interoperability framework in alignment with the new European Interoperability Framework (EIF) and EHDS governance standards. This template maps EHDS2-recommended standards to the EIF model and principles, linking the proposed EHDS2 data quality framework to relevant International Organization for Standardization (ISO) standards. Using this template, we analyzed and compared how the recommended EHDS2 standards were implemented across the studied projects. During workshops, project teams shared insights on overcoming interoperability challenges and their innovative approaches to bridging gaps in standardization. With support from HSbooster.eu, we facilitated collaboration among these projects to exchange knowledge on standards, legal implementation, project sustainability, and harmonization with EHDS2. The findings from this work, including the created template and lessons learned, will be compiled into an interactive toolkit for the EHDS2 interoperability framework. This toolkit will help existing and future projects align with EHDS2 technical and legal requirements, serving as a foundation for a common EHDS2 interoperability framework. Additionally, standardization efforts include participation in the development of ISO/IEC 21823-3:2021—Semantic Interoperability for IoT Systems. Since no ISO standard currently exists for digital pathology and AI-based image analysis for medical diagnostics, the BigPicture project is contributing to ISO/PWI 24051-2, which focuses on digital pathology and AI-based, whole-slide image analysis. Integrating these efforts with ongoing ISO initiatives can enhance global standardization and facilitate widespread adoption across health care systems. %M 40126534 %R 10.2196/66273 %U https://www.jmir.org/2025/1/e66273 %U https://doi.org/10.2196/66273 %U http://www.ncbi.nlm.nih.gov/pubmed/40126534 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e65729 %T Utility-based Analysis of Statistical Approaches and Deep Learning Models for Synthetic Data Generation With Focus on Correlation Structures: Algorithm Development and Validation %A Miletic,Marko %A Sariyar,Murat %+ Institute for Optimisation and Data Analysis (IODA), Bern University of Applied Sciences, Höheweg 80, Biel, 2502, Switzerland, 41 32 321 64 37, murat.sariyar@bfh.ch %K synthetic data generation %K medical data synthesis %K random forests %K simulation study %K deep learning %K propensity score mean-squared error %D 2025 %7 20.3.2025 %9 Original Paper %J JMIR AI %G English %X Background: Recent advancements in Generative Adversarial Networks and large language models (LLMs) have significantly advanced the synthesis and augmentation of medical data. These and other deep learning–based methods offer promising potential for generating high-quality, realistic datasets crucial for improving machine learning applications in health care, particularly in contexts where data privacy and availability are limiting factors. However, challenges remain in accurately capturing the complex associations inherent in medical datasets. Objective: This study evaluates the effectiveness of various Synthetic Data Generation (SDG) methods in replicating the correlation structures inherent in real medical datasets. In addition, it examines their performance in downstream tasks using Random Forests (RFs) as the benchmark model. To provide a comprehensive analysis, alternative models such as eXtreme Gradient Boosting and Gated Additive Tree Ensembles are also considered. We compare the following SDG approaches: Synthetic Populations in R (synthpop), copula, copulagan, Conditional Tabular Generative Adversarial Network (ctgan), tabular variational autoencoder (tvae), and tabula for LLMs. Methods: We evaluated synthetic data generation methods using both real-world and simulated datasets. Simulated data consist of 10 Gaussian variables and one binary target variable with varying correlation structures, generated via Cholesky decomposition. Real-world datasets include the body performance dataset with 13,393 samples for fitness classification, the Wisconsin Breast Cancer dataset with 569 samples for tumor diagnosis, and the diabetes dataset with 768 samples for diabetes prediction. Data quality is evaluated by comparing correlation matrices, the propensity score mean-squared error (pMSE) for general utility, and F1-scores for downstream tasks as a specific utility metric, using training on synthetic data and testing on real data. Results: Our simulation study, supplemented with real-world data analyses, shows that the statistical methods copula and synthpop consistently outperform deep learning approaches across various sample sizes and correlation complexities, with synthpop being the most effective. Deep learning methods, including large LLMs, show mixed performance, particularly with smaller datasets or limited training epochs. LLMs often struggle to replicate numerical dependencies effectively. In contrast, methods like tvae with 10,000 epochs perform comparably well. On the body performance dataset, copulagan achieves the best performance in terms of pMSE. The results also highlight that model utility depends more on the relative correlations between features and the target variable than on the absolute magnitude of correlation matrix differences. Conclusions: Statistical methods, particularly synthpop, demonstrate superior robustness and utility preservation for synthetic tabular data compared with deep learning approaches. Copula methods show potential but face limitations with integer variables. Deep Learning methods underperform in this context. Overall, these findings underscore the dominance of statistical methods for synthetic data generation for tabular data, while highlighting the niche potential of deep learning approaches for highly complex datasets, provided adequate resources and tuning. %M 40112290 %R 10.2196/65729 %U https://ai.jmir.org/2025/1/e65729 %U https://doi.org/10.2196/65729 %U http://www.ncbi.nlm.nih.gov/pubmed/40112290 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e58455 %T Obtaining the Most Accurate, Explainable Model for Predicting Chronic Obstructive Pulmonary Disease: Triangulation of Multiple Linear Regression and Machine Learning Methods %A Kamis,Arnold %A Gadia,Nidhi %A Luo,Zilin %A Ng,Shu Xin %A Thumbar,Mansi %+ Brandeis International Business School, Brandeis University, Sachar International Center, 415 South St, Waltham, MA, 02453, United States, 1 781 736 8544, akamis@brandeis.edu %K chronic obstructive pulmonary disease %K COPD %K cigarette smoking %K ethnic and racial differences %K machine learning %K multiple linear regression %K household income %K practical model %D 2024 %7 29.8.2024 %9 Original Paper %J JMIR AI %G English %X Background: Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019. Objective: We gathered a diverse set of non–personally identifiable information from public data sources to better understand and predict COPD rates at the core-based statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD. Methods: We integrated non–personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a well-known contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods. Results: The most accurate multiple linear regression model has variance explained of 81.1%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level. Conclusions: This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health. %M 39207843 %R 10.2196/58455 %U https://ai.jmir.org/2024/1/e58455 %U https://doi.org/10.2196/58455 %U http://www.ncbi.nlm.nih.gov/pubmed/39207843 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e50800 %T Optimizing Clinical Trial Eligibility Design Using Natural Language Processing Models and Real-World Data: Algorithm Development and Validation %A Lee,Kyeryoung %A Liu,Zongzhi %A Mai,Yun %A Jun,Tomi %A Ma,Meng %A Wang,Tongyu %A Ai,Lei %A Calay,Ediz %A Oh,William %A Stolovitzky,Gustavo %A Schadt,Eric %A Wang,Xiaoyan %+ GendDx (Sema4), 333 Ludlow Street, Stamford, CT, 06902, United States, 1 (844) 241 1233, xw108@caa.columbia.edu %K natural language processing %K real-world data %K clinical trial eligibility criteria %K eligibility criteria–specific ontology %K clinical trial protocol optimization %K data-driven approach %D 2024 %7 29.7.2024 %9 Original Paper %J JMIR AI %G English %X Background: Clinical trials are vital for developing new therapies but can also delay drug development. Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines. Natural language processing (NLP) has the potential to achieve these objectives. Objective: This study aims to assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients. This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records using deep learning–based NLP techniques. Methods: We obtained data of 3281 industry-sponsored phase 2 or 3 interventional clinical trials recruiting patients with non–small cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn disease from ClinicalTrials.gov, spanning the period between 2013 and 2020. A customized bidirectional long short-term memory– and conditional random field–based NLP pipeline was used to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms along with their corresponding values. To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of patients with non–small cell lung cancer (n=2775), curated from the Mount Sinai Health System, as a pilot study. Results: We manually annotated the clinical trial eligibility corpus (485/3281, 14.78% trials) and constructed an eligibility criteria–specific ontology. Our customized NLP pipeline, developed based on the eligibility criteria–specific ontology that we created through manual annotation, achieved high precision (0.91, range 0.67-1.00) and recall (0.79, range 0.50-1) scores, as well as a high F1-score (0.83, range 0.67-1), enabling the efficient extraction of granular criteria entities and relevant attributes from 3281 clinical trials. A standardized eligibility criteria knowledge base, compatible with electronic health records, was developed by transforming hypernym concepts into machine-interpretable hyponyms along with their corresponding values. In addition, an interface prototype demonstrated the practicality of leveraging real-world data for optimizing clinical trial protocols and identifying eligible patients. Conclusions: Our customized NLP pipeline successfully generated a standardized eligibility criteria knowledge base by transforming hypernym criteria into machine-readable hyponyms along with their corresponding values. A prototype interface integrating real-world patient information allows us to assess the impact of each eligibility criterion on the number of patients eligible for the trial. Leveraging NLP and real-world data in a data-driven approach holds promise for streamlining the overall clinical trial process, optimizing processes, and improving efficiency in patient identification. %M 39073872 %R 10.2196/50800 %U https://ai.jmir.org/2024/1/e50800 %U https://doi.org/10.2196/50800 %U http://www.ncbi.nlm.nih.gov/pubmed/39073872 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e51118 %T Multiscale Bowel Sound Event Spotting in Highly Imbalanced Wearable Monitoring Data: Algorithm Development and Validation Study %A Baronetto,Annalisa %A Graf,Luisa %A Fischer,Sarah %A Neurath,Markus F %A Amft,Oliver %+ Hahn-Schickard, Georges-Köhler-Allee 302, Freiburg, 79110, Germany, 49 761 887 865 ext 732, ab1591@students.uni-freiburg.de %K bowel sound %K deep learning %K event spotting %K wearable sensors %D 2024 %7 10.7.2024 %9 Original Paper %J JMIR AI %G English %X Background: Abdominal auscultation (i.e., listening to bowel sounds (BSs)) can be used to analyze digestion. An automated retrieval of BS would be beneficial to assess gastrointestinal disorders noninvasively. Objective: This study aims to develop a multiscale spotting model to detect BSs in continuous audio data from a wearable monitoring system. Methods: We designed a spotting model based on the Efficient-U-Net (EffUNet) architecture to analyze 10-second audio segments at a time and spot BSs with a temporal resolution of 25 ms. Evaluation data were collected across different digestive phases from 18 healthy participants and 9 patients with inflammatory bowel disease (IBD). Audio data were recorded in a daytime setting with a smart T-Shirt that embeds digital microphones. The data set was annotated by independent raters with substantial agreement (Cohen κ between 0.70 and 0.75), resulting in 136 hours of labeled data. In total, 11,482 BSs were analyzed, with a BS duration ranging between 18 ms and 6.3 seconds. The share of BSs in the data set (BS ratio) was 0.0089. We analyzed the performance depending on noise level, BS duration, and BS event rate. We also report spotting timing errors. Results: Leave-one-participant-out cross-validation of BS event spotting yielded a median F1-score of 0.73 for both healthy volunteers and patients with IBD. EffUNet detected BSs under different noise conditions with 0.73 recall and 0.72 precision. In particular, for a signal-to-noise ratio over 4 dB, more than 83% of BSs were recognized, with precision of 0.77 or more. EffUNet recall dropped below 0.60 for BS duration of 1.5 seconds or less. At a BS ratio greater than 0.05, the precision of our model was over 0.83. For both healthy participants and patients with IBD, insertion and deletion timing errors were the largest, with a total of 15.54 minutes of insertion errors and 13.08 minutes of deletion errors over the total audio data set. On our data set, EffUNet outperformed existing BS spotting models that provide similar temporal resolution. Conclusions: The EffUNet spotter is robust against background noise and can retrieve BSs with varying duration. EffUNet outperforms previous BS detection approaches in unmodified audio data, containing highly sparse BS events. %M 38985504 %R 10.2196/51118 %U https://ai.jmir.org/2024/1/e51118 %U https://doi.org/10.2196/51118 %U http://www.ncbi.nlm.nih.gov/pubmed/38985504 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e47805 %T Framework for Ranking Machine Learning Predictions of Limited, Multimodal, and Longitudinal Behavioral Passive Sensing Data: Combining User-Agnostic and Personalized Modeling %A Mullick,Tahsin %A Shaaban,Sam %A Radovic,Ana %A Doryab,Afsaneh %+ Department of Systems and Information Engineering, University of Virginia, Olsson Hall, 151 Engineer's Way, Charlottesville, VA, 22903, United States, 1 4349245393, tum7q@virginia.edu %K machine learning %K AI %K artificial intelligence %K passive sensing %K ranking framework %K small health data set %K ranking %K algorithm %K algorithms %K sensor %K multimodal %K predict %K prediction %K agnostic %K framework %K validation %K data set %D 2024 %7 20.5.2024 %9 Original Paper %J JMIR AI %G English %X Background: Passive mobile sensing provides opportunities for measuring and monitoring health status in the wild and outside of clinics. However, longitudinal, multimodal mobile sensor data can be small, noisy, and incomplete. This makes processing, modeling, and prediction of these data challenging. The small size of the data set restricts it from being modeled using complex deep learning networks. The current state of the art (SOTA) tackles small sensor data sets following a singular modeling paradigm based on traditional machine learning (ML) algorithms. These opt for either a user-agnostic modeling approach, making the model susceptible to a larger degree of noise, or a personalized approach, where training on individual data alludes to a more limited data set, giving rise to overfitting, therefore, ultimately, having to seek a trade-off by choosing 1 of the 2 modeling approaches to reach predictions. Objective: The objective of this study was to filter, rank, and output the best predictions for small, multimodal, longitudinal sensor data using a framework that is designed to tackle data sets that are limited in size (particularly targeting health studies that use passive multimodal sensors) and that combines both user agnostic and personalized approaches, along with a combination of ranking strategies to filter predictions. Methods: In this paper, we introduced a novel ranking framework for longitudinal multimodal sensors (FLMS) to address challenges encountered in health studies involving passive multimodal sensors. Using the FLMS, we (1) built a tensor-based aggregation and ranking strategy for final interpretation, (2) processed various combinations of sensor fusions, and (3) balanced user-agnostic and personalized modeling approaches with appropriate cross-validation strategies. The performance of the FLMS was validated with the help of a real data set of adolescents diagnosed with major depressive disorder for the prediction of change in depression in the adolescent participants. Results: Predictions output by the proposed FLMS achieved a 7% increase in accuracy and a 13% increase in recall for the real data set. Experiments with existing SOTA ML algorithms showed an 11% increase in accuracy for the depression data set and how overfitting and sparsity were handled. Conclusions: The FLMS aims to fill the gap that currently exists when modeling passive sensor data with a small number of data points. It achieves this through leveraging both user-agnostic and personalized modeling techniques in tandem with an effective ranking strategy to filter predictions. %M 38875667 %R 10.2196/47805 %U https://ai.jmir.org/2024/1/e47805 %U https://doi.org/10.2196/47805 %U http://www.ncbi.nlm.nih.gov/pubmed/38875667 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e46875 %T Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study %A Hammoud,Mohammad %A Douglas,Shahd %A Darmach,Mohamad %A Alawneh,Sara %A Sanyal,Swapnendu %A Kanbour,Youssef %+ Avey Inc, Qatar Science and Technology Park, Doha, 210022, Qatar, 974 3001 8035, mhh@avey.ai %K digital health %K symptom checker %K artificial intelligence %K AI %K patient-centered care %K eHealth apps %K eHealth %D 2024 %7 29.4.2024 %9 Original Paper %J JMIR AI %G English %X Background: Medical self-diagnostic tools (or symptom checkers) are becoming an integral part of digital health and our daily lives, whereby patients are increasingly using them to identify the underlying causes of their symptoms. As such, it is essential to rigorously investigate and comprehensively report the diagnostic performance of symptom checkers using standard clinical and scientific approaches. Objective: This study aims to evaluate and report the accuracies of a few known and new symptom checkers using a standard and transparent methodology, which allows the scientific community to cross-validate and reproduce the reported results, a step much needed in health informatics. Methods: We propose a 4-stage experimentation methodology that capitalizes on the standard clinical vignette approach to evaluate 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced primary care physicians. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 (SD 9.42) years. To measure accuracy, we used 7 standard metrics, including M1 as a measure of a symptom checker’s or a physician’s ability to return a vignette’s main diagnosis at the top of their differential list, F1-score as a trade-off measure between recall and precision, and Normalized Discounted Cumulative Gain (NDCG) as a measure of a differential list’s ranking quality, among others. Results: The diagnostic accuracies of the 6 tested symptom checkers vary significantly. For instance, the differences in the M1, F1-score, and NDCG results between the best-performing and worst-performing symptom checkers or ranges were 65.3%, 39.2%, and 74.2%, respectively. The same was observed among the participating human physicians, whereby the M1, F1-score, and NDCG ranges were 22.8%, 15.3%, and 21.3%, respectively. When compared against each other, physicians outperformed the best-performing symptom checker by an average of 1.2% using F1-score, whereas the best-performing symptom checker outperformed physicians by averages of 10.2% and 25.1% using M1 and NDCG, respectively. Conclusions: The performance variation between symptom checkers is substantial, suggesting that symptom checkers cannot be treated as a single entity. On a different note, the best-performing symptom checker was an artificial intelligence (AI)–based one, shedding light on the promise of AI in improving the diagnostic capabilities of symptom checkers, especially as AI keeps advancing exponentially. %M 38875676 %R 10.2196/46875 %U https://ai.jmir.org/2024/1/e46875 %U https://doi.org/10.2196/46875 %U http://www.ncbi.nlm.nih.gov/pubmed/38875676 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e52615 %T Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial %A Yan,Chao %A Zhang,Ziqi %A Nyemba,Steve %A Li,Zhuohang %+ Department of Biomedical Informatics, Vanderbilt University Medical Center, Suite 1475, 2525 West End Ave, Nashville, TN, 37203, United States, 1 6155126877, chao.yan.1@vumc.org %K synthetic data generation %K electronic health record %K generative neural networks %K tutorial %D 2024 %7 22.4.2024 %9 Tutorial %J JMIR AI %G English %X Synthetic electronic health record (EHR) data generation has been increasingly recognized as an important solution to expand the accessibility and maximize the value of private health data on a large scale. Recent advances in machine learning have facilitated more accurate modeling for complex and high-dimensional data, thereby greatly enhancing the data quality of synthetic EHR data. Among various approaches, generative adversarial networks (GANs) have become the main technical path in the literature due to their ability to capture the statistical characteristics of real data. However, there is a scarcity of detailed guidance within the domain regarding the development procedures of synthetic EHR data. The objective of this tutorial is to present a transparent and reproducible process for generating structured synthetic EHR data using a publicly accessible EHR data set as an example. We cover the topics of GAN architecture, EHR data types and representation, data preprocessing, GAN training, synthetic data generation and postprocessing, and data quality evaluation. We conclude this tutorial by discussing multiple important issues and future opportunities in this domain. The source code of the entire process has been made publicly available. %M 38875595 %R 10.2196/52615 %U https://ai.jmir.org/2024/1/e52615 %U https://doi.org/10.2196/52615 %U http://www.ncbi.nlm.nih.gov/pubmed/38875595 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e47652 %T Privacy-Preserving Federated Survival Support Vector Machines for Cross-Institutional Time-To-Event Analysis: Algorithm Development and Validation %A Späth,Julian %A Sewald,Zeno %A Probul,Niklas %A Berland,Magali %A Almeida,Mathieu %A Pons,Nicolas %A Le Chatelier,Emmanuelle %A Ginès,Pere %A Solé,Cristina %A Juanola,Adrià %A Pauling,Josch %A Baumbach,Jan %+ Institute for Computational Systems Biology, University of Hamburg, Notkestrasse 9, Hamburg, 22607, Germany, 49 15750665331, julian.alexander.spaeth@uni-hamburg.de %K federated learning %K survival analysis %K support vector machine %K machine learning %K federated %K algorithm %K survival %K FeatureCloud %K predict %K predictive %K prediction %K predictions %K Implementation science %K Implementation %K centralized model %K privacy regulation %D 2024 %7 29.3.2024 %9 Original Paper %J JMIR AI %G English %X Background: Central collection of distributed medical patient data is problematic due to strict privacy regulations. Especially in clinical environments, such as clinical time-to-event studies, large sample sizes are critical but usually not available at a single institution. It has been shown recently that federated learning, combined with privacy-enhancing technologies, is an excellent and privacy-preserving alternative to data sharing. Objective: This study aims to develop and validate a privacy-preserving, federated survival support vector machine (SVM) and make it accessible for researchers to perform cross-institutional time-to-event analyses. Methods: We extended the survival SVM algorithm to be applicable in federated environments. We further implemented it as a FeatureCloud app, enabling it to run in the federated infrastructure provided by the FeatureCloud platform. Finally, we evaluated our algorithm on 3 benchmark data sets, a large sample size synthetic data set, and a real-world microbiome data set and compared the results to the corresponding central method. Results: Our federated survival SVM produces highly similar results to the centralized model on all data sets. The maximal difference between the model weights of the central model and the federated model was only 0.001, and the mean difference over all data sets was 0.0002. We further show that by including more data in the analysis through federated learning, predictions are more accurate even in the presence of site-dependent batch effects. Conclusions: The federated survival SVM extends the palette of federated time-to-event analysis methods by a robust machine learning approach. To our knowledge, the implemented FeatureCloud app is the first publicly available implementation of a federated survival SVM, is freely accessible for all kinds of researchers, and can be directly used within the FeatureCloud platform. %M 38875678 %R 10.2196/47652 %U https://ai.jmir.org/2024/1/e47652 %U https://doi.org/10.2196/47652 %U http://www.ncbi.nlm.nih.gov/pubmed/38875678 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e52211 %T The Impact of Expectation Management and Model Transparency on Radiologists’ Trust and Utilization of AI Recommendations for Lung Nodule Assessment on Computed Tomography: Simulated Use Study %A Ewals,Lotte J S %A Heesterbeek,Lynn J J %A Yu,Bin %A van der Wulp,Kasper %A Mavroeidis,Dimitrios %A Funk,Mathias %A Snijders,Chris C P %A Jacobs,Igor %A Nederend,Joost %A Pluyter,Jon R %A , %+ Catharina Cancer Institute, Catharina Hospital Eindhoven, Michelangelolaan 2, Eindhoven, 5623 EJ, Netherlands, 31 40 239 9111, lotte.ewals@catharinaziekenhuis.nl %K application %K artificial intelligence %K AI %K computer-aided detection or diagnosis %K CAD %K design %K human centered %K human computer interaction %K HCI %K interaction %K mental model %K radiologists %K trust %D 2024 %7 13.3.2024 %9 Original Paper %J JMIR AI %G English %X Background: Many promising artificial intelligence (AI) and computer-aided detection and diagnosis systems have been developed, but few have been successfully integrated into clinical practice. This is partially owing to a lack of user-centered design of AI-based computer-aided detection or diagnosis (AI-CAD) systems. Objective: We aimed to assess the impact of different onboarding tutorials and levels of AI model explainability on radiologists’ trust in AI and the use of AI recommendations in lung nodule assessment on computed tomography (CT) scans. Methods: In total, 20 radiologists from 7 Dutch medical centers performed lung nodule assessment on CT scans under different conditions in a simulated use study as part of a 2×2 repeated-measures quasi-experimental design. Two types of AI onboarding tutorials (reflective vs informative) and 2 levels of AI output (black box vs explainable) were designed. The radiologists first received an onboarding tutorial that was either informative or reflective. Subsequently, each radiologist assessed 7 CT scans, first without AI recommendations. AI recommendations were shown to the radiologist, and they could adjust their initial assessment. Half of the participants received the recommendations via black box AI output and half received explainable AI output. Mental model and psychological trust were measured before onboarding, after onboarding, and after assessing the 7 CT scans. We recorded whether radiologists changed their assessment on found nodules, malignancy prediction, and follow-up advice for each CT assessment. In addition, we analyzed whether radiologists’ trust in their assessments had changed based on the AI recommendations. Results: Both variations of onboarding tutorials resulted in a significantly improved mental model of the AI-CAD system (informative P=.01 and reflective P=.01). After using AI-CAD, psychological trust significantly decreased for the group with explainable AI output (P=.02). On the basis of the AI recommendations, radiologists changed the number of reported nodules in 27 of 140 assessments, malignancy prediction in 32 of 140 assessments, and follow-up advice in 12 of 140 assessments. The changes were mostly an increased number of reported nodules, a higher estimated probability of malignancy, and earlier follow-up. The radiologists’ confidence in their found nodules changed in 82 of 140 assessments, in their estimated probability of malignancy in 50 of 140 assessments, and in their follow-up advice in 28 of 140 assessments. These changes were predominantly increases in confidence. The number of changed assessments and radiologists’ confidence did not significantly differ between the groups that received different onboarding tutorials and AI outputs. Conclusions: Onboarding tutorials help radiologists gain a better understanding of AI-CAD and facilitate the formation of a correct mental model. If AI explanations do not consistently substantiate the probability of malignancy across patient cases, radiologists’ trust in the AI-CAD system can be impaired. Radiologists’ confidence in their assessments was improved by using the AI recommendations. %M 38875574 %R 10.2196/52211 %U https://ai.jmir.org/2024/1/e52211 %U https://doi.org/10.2196/52211 %U http://www.ncbi.nlm.nih.gov/pubmed/38875574 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e49784 %T Sepsis Prediction at Emergency Department Triage Using Natural Language Processing: Retrospective Cohort Study %A Brann,Felix %A Sterling,Nicholas William %A Frisch,Stephanie O %A Schrager,Justin D %+ Department of Emergency Medicine, Emory University School of Medicine, 531 Asbury Circle, Annex Building N340, Atlanta, GA, 30322, United States, 1 404 778 5975, justin@vitaler.com %K natural language processing %K machine learning %K sepsis %K emergency department %K triage %D 2024 %7 25.1.2024 %9 Original Paper %J JMIR AI %G English %X Background: Despite its high lethality, sepsis can be difficult to detect on initial presentation to the emergency department (ED). Machine learning–based tools may provide avenues for earlier detection and lifesaving intervention. Objective: The study aimed to predict sepsis at the time of ED triage using natural language processing of nursing triage notes and available clinical data. Methods: We constructed a retrospective cohort of all 1,234,434 consecutive ED encounters in 2015-2021 from 4 separate clinically heterogeneous academically affiliated EDs. After exclusion criteria were applied, the final cohort included 1,059,386 adult ED encounters. The primary outcome criteria for sepsis were presumed severe infection and acute organ dysfunction. After vectorization and dimensional reduction of triage notes and clinical data available at triage, a decision tree–based ensemble (time-of-triage) model was trained to predict sepsis using the training subset (n=950,921). A separate (comprehensive) model was trained using these data and laboratory data, as it became available at 1-hour intervals, after triage. Model performances were evaluated using the test (n=108,465) subset. Results: Sepsis occurred in 35,318 encounters (incidence 3.45%). For sepsis prediction at the time of patient triage, using the primary definition, the area under the receiver operating characteristic curve (AUC) and macro F1-score for sepsis were 0.94 and 0.61, respectively. Sensitivity, specificity, and false positive rate were 0.87, 0.85, and 0.15, respectively. The time-of-triage model accurately predicted sepsis in 76% (1635/2150) of sepsis cases where sepsis screening was not initiated at triage and 97.5% (1630/1671) of cases where sepsis screening was initiated at triage. Positive and negative predictive values were 0.18 and 0.99, respectively. For sepsis prediction using laboratory data available each hour after ED arrival, the AUC peaked to 0.97 at 12 hours. Similar results were obtained when stratifying by hospital and when Centers for Disease Control and Prevention hospital toolkit for adult sepsis surveillance criteria were used to define sepsis. Among septic cases, sepsis was predicted in 36.1% (1375/3814), 49.9% (1902/3814), and 68.3% (2604/3814) of encounters, respectively, at 3, 2, and 1 hours prior to the first intravenous antibiotic order or where antibiotics where not ordered within the first 12 hours. Conclusions: Sepsis can accurately be predicted at ED presentation using nursing triage notes and clinical information available at the time of triage. This indicates that machine learning can facilitate timely and reliable alerting for intervention. Free-text data can improve the performance of predictive modeling at the time of triage and throughout the ED course. %M 38875594 %R 10.2196/49784 %U https://ai.jmir.org/2024/1/e49784 %U https://doi.org/10.2196/49784 %U http://www.ncbi.nlm.nih.gov/pubmed/38875594 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e51168 %T Learning From International Comparators of National Medical Imaging Initiatives for AI Development: Multiphase Qualitative Study %A Karpathakis,Kassandra %A Pencheon,Emma %A Cushnan,Dominic %+ Decimal.health, 50 Milk Street, Boston, MA, 02109, United States, 1 6086285988, kass.karpathakis@gmail.com %K digital health %K mobile health %K mHealth %K medical imaging %K artificial intelligence %K health policy %D 2024 %7 4.1.2024 %9 Original Paper %J JMIR AI %G English %X Background: The COVID-19 pandemic drove investment and research into medical imaging platforms to provide data to create artificial intelligence (AI) algorithms for the management of patients with COVID-19. Building on the success of England’s National COVID-19 Chest Imaging Database, the national digital policy body (NHSX) sought to create a generalized national medical imaging platform for the development, validation, and deployment of algorithms. Objective: This study aims to understand international use cases of medical imaging platforms for the development and implementation of algorithms to inform the creation of England’s national imaging platform. Methods: The National Health Service (NHS) AI Lab Policy and Strategy Team adopted a multiphased approach: (1) identification and prioritization of national AI imaging platforms; (2) Political, Economic, Social, Technological, Legal, and Environmental (PESTLE) factor analysis deep dive into national AI imaging platforms; (3) semistructured interviews with key stakeholders; (4) workshop on emerging themes and insights with the internal NHSX team; and (5) formulation of policy recommendations. Results: International use cases of national AI imaging platforms (n=7) were prioritized for PESTLE factor analysis. Stakeholders (n=13) from the international use cases were interviewed. Themes (n=8) from the semistructured interviews, including interview quotes, were analyzed with workshop participants (n=5). The outputs of the deep dives, interviews, and workshop were synthesized thematically into 8 categories with 17 subcategories. On the basis of the insights from the international use cases, policy recommendations (n=12) were developed to support the NHS AI Lab in the design and development of the English national medical imaging platform. Conclusions: The creation of AI algorithms supporting technology and infrastructure such as platforms often occurs in isolation within countries, let alone between countries. This novel policy research project sought to bridge the gap by learning from the challenges, successes, and experience of England’s international counterparts. Policy recommendations based on international learnings focused on the demonstrable benefits of the platform to secure sustainable funding, validation of algorithms and infrastructure to support in situ deployment, and creating wraparound tools for nontechnical participants such as clinicians to engage with algorithm creation. As health care organizations increasingly adopt technological solutions, policy makers have a responsibility to ensure that initiatives are informed by learnings from both national and international initiatives as well as disseminating the outcomes of their work. %M 38875584 %R 10.2196/51168 %U https://ai.jmir.org/2024/1/e51168 %U https://doi.org/10.2196/51168 %U http://www.ncbi.nlm.nih.gov/pubmed/38875584 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 2 %N %P e44909 %T Machine Learning for the Prediction of Procedural Case Durations Developed Using a Large Multicenter Database: Algorithm Development and Validation Study %A Kendale,Samir %A Bishara,Andrew %A Burns,Michael %A Solomon,Stuart %A Corriere,Matthew %A Mathis,Michael %+ Department of Anesthesia, Critical Care & Pain Medicine, Beth Israel Deaconess Medical Center, 1 Deaconess Road, Boston, MA, 02215, United States, 1 6177545400, skendale@bidmc.harvard.edu %K medical informatics %K artificial intelligence %K AI %K machine learning %K operating room %K OR management %K perioperative %K algorithm development %K validation %K patient communication %K surgical procedure %K prediction model %D 2023 %7 8.9.2023 %9 Original Paper %J JMIR AI %G English %X Background: Accurate projections of procedural case durations are complex but critical to the planning of perioperative staffing, operating room resources, and patient communication. Nonlinear prediction models using machine learning methods may provide opportunities for hospitals to improve upon current estimates of procedure duration. Objective: The aim of this study was to determine whether a machine learning algorithm scalable across multiple centers could make estimations of case duration within a tolerance limit because there are substantial resources required for operating room functioning that relate to case duration. Methods: Deep learning, gradient boosting, and ensemble machine learning models were generated using perioperative data available at 3 distinct time points: the time of scheduling, the time of patient arrival to the operating or procedure room (primary model), and the time of surgical incision or procedure start. The primary outcome was procedure duration, defined by the time between the arrival and the departure of the patient from the procedure room. Model performance was assessed by mean absolute error (MAE), the proportion of predictions falling within 20% of the actual duration, and other standard metrics. Performance was compared with a baseline method of historical means within a linear regression model. Model features driving predictions were assessed using Shapley additive explanations values and permutation feature importance. Results: A total of 1,177,893 procedures from 13 academic and private hospitals between 2016 and 2019 were used. Across all procedures, the median procedure duration was 94 (IQR 50-167) minutes. In estimating the procedure duration, the gradient boosting machine was the best-performing model, demonstrating an MAE of 34 (SD 47) minutes, with 46% of the predictions falling within 20% of the actual duration in the test data set. This represented a statistically and clinically significant improvement in predictions compared with a baseline linear regression model (MAE 43 min; P<.001; 39% of the predictions falling within 20% of the actual duration). The most important features in model training were historical procedure duration by surgeon, the word “free” within the procedure text, and the time of day. Conclusions: Nonlinear models using machine learning techniques may be used to generate high-performing, automatable, explainable, and scalable prediction models for procedure duration. %M 38875567 %R 10.2196/44909 %U https://ai.jmir.org/2023/1/e44909 %U https://doi.org/10.2196/44909 %U http://www.ncbi.nlm.nih.gov/pubmed/38875567 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 2 %N %P e45000 %T Artificial Intelligence–Enabled Software Prototype to Inform Opioid Pharmacovigilance From Electronic Health Records: Development and Usability Study %A Sorbello,Alfred %A Haque,Syed Arefinul %A Hasan,Rashedul %A Jermyn,Richard %A Hussein,Ahmad %A Vega,Alex %A Zembrzuski,Krzysztof %A Ripple,Anna %A Ahadpour,Mitra %+ Center for Drug Evaluation and Research, US Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, MD, 20993, United States, 1 (888) 463 6332, mdrashedul.hasan@fda.hhs.gov %K electronic health records %K pharmacovigilance %K artificial intelligence %K real world data %K EHR %K natural language %K software application %K drug %K Food and Drug Administration %K deep learning %D 2023 %7 18.7.2023 %9 Original Paper %J JMIR AI %G English %X Background: The use of patient health and treatment information captured in structured and unstructured formats in computerized electronic health record (EHR) repositories could potentially augment the detection of safety signals for drug products regulated by the US Food and Drug Administration (FDA). Natural language processing and other artificial intelligence (AI) techniques provide novel methodologies that could be leveraged to extract clinically useful information from EHR resources. Objective: Our aim is to develop a novel AI-enabled software prototype to identify adverse drug event (ADE) safety signals from free-text discharge summaries in EHRs to enhance opioid drug safety and research activities at the FDA. Methods: We developed a prototype for web-based software that leverages keyword and trigger-phrase searching with rule-based algorithms and deep learning to extract candidate ADEs for specific opioid drugs from discharge summaries in the Medical Information Mart for Intensive Care III (MIMIC III) database. The prototype uses MedSpacy components to identify relevant sections of discharge summaries and a pretrained natural language processing (NLP) model, Spark NLP for Healthcare, for named entity recognition. Fifteen FDA staff members provided feedback on the prototype’s features and functionalities. Results: Using the prototype, we were able to identify known, labeled, opioid-related adverse drug reactions from text in EHRs. The AI-enabled model achieved accuracy, recall, precision, and F1-scores of 0.66, 0.69, 0.64, and 0.67, respectively. FDA participants assessed the prototype as highly desirable in user satisfaction, visualizations, and in the potential to support drug safety signal detection for opioid drugs from EHR data while saving time and manual effort. Actionable design recommendations included (1) enlarging the tabs and visualizations; (2) enabling more flexibility and customizations to fit end users’ individual needs; (3) providing additional instructional resources; (4) adding multiple graph export functionality; and (5) adding project summaries. Conclusions: The novel prototype uses innovative AI-based techniques to automate searching for, extracting, and analyzing clinically useful information captured in unstructured text in EHRs. It increases efficiency in harnessing real-world data for opioid drug safety and increases the usability of the data to support regulatory review while decreasing the manual research burden. %M 37771410 %R 10.2196/45000 %U https://ai.jmir.org/2023/1/e45000 %U https://doi.org/10.2196/45000 %U http://www.ncbi.nlm.nih.gov/pubmed/37771410 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 2 %N %P e44779 %T A Scalable Radiomics- and Natural Language Processing–Based Machine Learning Pipeline to Distinguish Between Painful and Painless Thoracic Spinal Bone Metastases: Retrospective Algorithm Development and Validation Study %A Naseri,Hossein %A Skamene,Sonia %A Tolba,Marwan %A Faye,Mame Daro %A Ramia,Paul %A Khriguian,Julia %A David,Marc %A Kildea,John %+ Medical Physics Unit, McGill University Health Centre, Cedars Cancer Centre, 1001 boul Décarie Montréal, Montreal, QC, H4A 3J1, Canada, 1 514 934 1934 ext 44158, 3naseri@gmail.com %K cancer %K pain %K palliative care %K radiotherapy %K bone metastases %K radiomics %K natural language processing %K machine learning %K artificial intelligent %K radiation therapy %D 2023 %7 22.5.2023 %9 Original Paper %J JMIR AI %G English %X Background: The identification of objective pain biomarkers can contribute to an improved understanding of pain, as well as its prognosis and better management. Hence, it has the potential to improve the quality of life of patients with cancer. Artificial intelligence can aid in the extraction of objective pain biomarkers for patients with cancer with bone metastases (BMs). Objective: This study aimed to develop and evaluate a scalable natural language processing (NLP)– and radiomics-based machine learning pipeline to differentiate between painless and painful BM lesions in simulation computed tomography (CT) images using imaging features (biomarkers) extracted from lesion center point–based regions of interest (ROIs). Methods: Patients treated at our comprehensive cancer center who received palliative radiotherapy for thoracic spine BM between January 2016 and September 2019 were included in this retrospective study. Physician-reported pain scores were extracted automatically from radiation oncology consultation notes using an NLP pipeline. BM center points were manually pinpointed on CT images by radiation oncologists. Nested ROIs with various diameters were automatically delineated around these expert-identified BM center points, and radiomics features were extracted from each ROI. Synthetic Minority Oversampling Technique resampling, the Least Absolute Shrinkage And Selection Operator feature selection method, and various machine learning classifiers were evaluated using precision, recall, F1-score, and area under the receiver operating characteristic curve. Results: Radiation therapy consultation notes and simulation CT images of 176 patients (mean age 66, SD 14 years; 95 males) with thoracic spine BM were included in this study. After BM center point identification, 107 radiomics features were extracted from each spherical ROI using pyradiomics. Data were divided into 70% and 30% training and hold-out test sets, respectively. In the test set, the accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve of our best performing model (neural network classifier on an ensemble ROI) were 0.82 (132/163), 0.59 (16/27), 0.85 (116/136), and 0.83, respectively. Conclusions: Our NLP- and radiomics-based machine learning pipeline was successful in differentiating between painful and painless BM lesions. It is intrinsically scalable by using NLP to extract pain scores from clinical notes and by requiring only center points to identify BM lesions in CT images. %M 38875572 %R 10.2196/44779 %U https://ai.jmir.org/2023/1/e44779 %U https://doi.org/10.2196/44779 %U http://www.ncbi.nlm.nih.gov/pubmed/38875572