Original Paper
Abstract
Background: Artificial intelligence (AI) is transforming medical imaging, with large language models such as ChatGPT-4 emerging as potential tools for automated image interpretation. While AI-driven radiomics has shown promise in diagnostic imaging, the efficacy of ChatGPT-4 in liver ultrasound analysis remains largely unexamined.
Objective: This study aimed to evaluate the capability of ChatGPT-4 in liver ultrasound radiomics, specifically its ability to differentiate fibrosis, steatosis, and normal liver tissue, compared with conventional image analysis software.
Methods: Seventy grayscale ultrasound images from a preclinical liver disease model, including fibrosis (n=31), fatty liver (n=18), and normal liver (n=21), were analyzed. ChatGPT-4 extracted texture features, which were compared with those obtained using interactive data language (IDL), a traditional image analysis software. One-way ANOVA was used to identify statistically significant features differentiating liver conditions, and logistic regression models were used to assess diagnostic performance.
Results: ChatGPT-4 extracted 9 key textural features—echo intensity, heterogeneity, skewness, kurtosis, contrast, homogeneity, dissimilarity, angular second momentum, and entropy—all of which significantly differed across liver conditions (P<.05). Among individual features, echo intensity achieved the highest F1-score (0.85). When combined, ChatGPT-4 attained 76% accuracy and 83% sensitivity in classifying liver disease. Receiver operating characteristic analysis demonstrated strong discriminatory performance, with area under the curve values of 0.75 for fibrosis, 0.87 for normal liver, and 0.97 for steatosis. Compared with IDL image analysis software, ChatGPT-4 exhibited slightly lower sensitivity (0.83 vs 0.89) but showed moderate correlation (r=0.68, P<.001) with IDL-derived features. However, it significantly outperformed IDL in processing efficiency, reducing analysis time by 40%, and highlighting its potential for high throughput radiomic analysis.
Conclusions: Despite slightly lower sensitivity than IDL, ChatGPT-4 demonstrated high feasibility for ultrasound radiomics, offering faster processing, high-throughput analysis, and automated multi-image evaluation. These findings support its potential integration into AI-driven imaging workflows, with further refinements needed to enhance feature reproducibility and diagnostic accuracy.
doi:10.2196/68144
Keywords
Introduction
In recent years, advancements in artificial intelligence (AI) have transformed various fields, and one notable application is in the realm of medical imaging [
- ]. AI holds significant potential in revolutionizing the field of medical imaging, as it can automate numerous tasks and even surpass human abilities in specific areas, whether it be in diagnostic or interventional applications [ ]. Integrating AI with ultrasound imaging is particularly compelling. Unlike other imaging modalities, ultrasound relies heavily on human operators [ , ]. This dependence on human expertise presents unique challenges, especially with the growing use of portable ultrasound devices. These devices are increasingly used by a diverse range of health care providers, including nonradiologists, who may have varying levels of training and experience [ ]. AI algorithms offer a powerful solution to mitigate the challenges associated with operator dependency in ultrasound imaging. These algorithms can play a crucial role in the automated detection of anomalies and significant findings, providing not only descriptive analysis but also valuable diagnostic guidance [ - ]. This capability is particularly beneficial for less experienced operators or in situations where expert radiologists are not readily available in regions with limited medical resources. The integration of AI in ultrasound imaging can lead to more accurate and efficient diagnostic processes, reducing the likelihood of human error and improving patient outcomes [ - ].ChatGPT is an advanced and powerful AI natural language processing model developed by OpenAI and was designed to comprehend and generate human-like text responses [
]. Having been extensively trained on a diverse corpus of data, ChatGPT has cultivated the capacity to grasp context, acquire knowledge from examples, and produce cohesive responses [ ]. Consequently, it has evolved into a versatile tool applicable to a wide array of uses, including health care and medical imaging [ - ]. In health care, its capacity to process and interpret vast amounts of information can support medical diagnostics, patient communication, and research. The latest version, ChatGPT-4, expands its ability to multimodal interactions, including image processing and potential capabilities in audio and video formats [ - ]. This enhancement is especially beneficial in health care, where it can analyze medical imagery, assist in creating educational materials, and offer visually descriptive assistance in patient care. By integrating advanced image analysis and generation, ChatGPT-4 stands poised to transform how AI supports health care professionals, offering tools for more accurate diagnoses, treatment planning, and patient engagement through rich, interactive media.In this study, we explore the potential of ChatGPT-4 in ultrasound imaging, particularly its capabilities in radiomics analysis for detailed tissue texture characterization. We focus on using ChatGPT-4–based radiomics to detect 3 distinct liver tissue types—normal, fibrotic, and fatty liver—using ultrasound images. To address challenges related to clinical data security, patient privacy, and ethical compliance, the liver ultrasound images in our study were sourced from an animal model. We then compared the findings generated by ChatGPT-4 with those obtained from conventional image analysis software. Our exploration highlights the potential of ChatGPT-4 to enhance research efforts and future clinical applications by improving the accuracy of quantitative image analysis.
Beyond radiomics analysis, we evaluated ChatGPT-4 as a tool for distinguishing normal from abnormal cases based on imaging findings. We aimed to demonstrate its capability as a supportive tool in clinical settings. Such a tool could significantly reduce the workload of radiologists by efficiently filtering out normal cases, allowing them to focus their expertise on more complex and abnormal cases. This expanded exploration highlights the promising role of ChatGPT-4 in enhancing diagnostic accuracy and supporting clinical decision-making in liver disease detection.
Methods
Image Data Acquisition
Seventy B-mode grayscale ultrasound images acquired from validated rat liver disease models [
- ] were used for analysis. The images were distributed across 3 categories of liver health: fibrosis (n=31), steatosis (fatty liver) (n=18), and normal (n=21). To maintain consistency and reliability in the analysis, the imaging parameters were standardized, including transducer frequency, gain settings, imaging depth, focus, and dynamic range. These standardizations ensured that the liver tissue’s echogenicity and overall image quality were consistent across all samples, allowing for accurate comparisons between the different health states. Additionally, each analysis focused on a single image depicting a section of the right lobe of the liver. The right lobe was chosen due to its larger size and easier accessibility, which provided a more representative and consistent area for imaging and subsequent histopathological validation. The liver pathology in these images was validated with histopathology, further ensuring the accuracy of the ultrasound-based categorization.Ultrasound Image Analysis by ChatGPT-4
Overview
We leveraged the advanced capabilities of ChatGPT-4 for radiomics analysis of ultrasound images. ChatGPT-4 was used to select regions of interest (ROIs), extract radiomic features, and classify liver disease conditions. These critical steps are depicted in
.
Region of Interest Delineation
The first critical step involved selecting a region of interest (ROI) within the liver tissue depicted in each ultrasound image (
B and 1C). ROIs were automatically defined using ChatGPT-4’s advanced algorithms. Upon receiving the query, ChatGPT-4 initially proposed an ROI based on its automated analysis of the image, highlighting a region that it determined to be representative of the liver parenchyma. Users then refined these suggestions to ensure alignment with clinical standards, making adjustments as necessary to ensure that the selected area was optimal for analysis meticulously excluding artifacts such as vascular structures, acoustic shadows, and reverberation. This interactive process allowed for fine-tuning of the ROI, combining the computational efficiency of ChatGPT-4 with the expert judgment of the user. Once the ROIs were verified for accuracy in an analyzed image, they were replicated across 10 subsequent ultrasound images, which were then uploaded for subsequent radiomics analysis using a batch processing approach ( ). Having liver images captured consistently in the same plane and region facilitated the reproducibility of ROI placement across the images. This method significantly enhanced the efficiency of our analysis, allowing for a more comprehensive assessment of liver tissue samplesFeature Extraction
Feature extraction was conducted in batches of 10 ultrasound images, the maximum allowed by ChatGPT-4, requiring multiple sessions to analyze all 70 cases. However, conducting analyses across different sessions introduced variability—some features were occasionally omitted, while others appeared inconsistently across sessions. To mitigate this session-dependent variability and ensure consistency in feature extraction, we implemented a standardized approach. At the beginning of each session, we carefully refined the prompts provided to ChatGPT-4 to align with previously extracted features (Table S1 in
). Missing features were explicitly requested, and any inconsistently appearing features were excluded. If ChatGPT-4 returned incomplete or inconsistent features, prompts were reissued or clarified until the correct output was obtained. Our final analysis included only features that were consistently and reliably extracted across all sessions. This approach minimizes session-to-session variation while maintaining reproducibility across users.ChatGPT-4 extracted a comprehensive set of radiomic features to characterize liver tissue texture [
- ]. These included first-order statistics and second-order texture features. First-order statistics are quantitative measures such as mean intensity, variance (heterogeneity), skewness, and kurtosis, reflecting pixel intensity distribution. Second-order texture features are derived from the gray-level co-occurrence matrix (GLCM), and these features include contrast, homogeneity, entropy, and angular second momentum (ASM), providing deep insights into spatial relationships and textural heterogeneity within the ROI.Machine Learning for Feature Model Assessment
The extracted radiomic features were used to develop a diagnostic model based on logistic regression, a method selected for its interpretability and clinical relevance. The model was configured with L2 regularization, and the regularization strength parameter (C) was optimized through grid search over a predefined range [
, ]. The liblinear solver was used for its suitability with small datasets, and the maximum number of iterations was set to 1000 to ensure model convergence. The dataset was divided into training (60%), testing (20%), and validation (20%) subsets using stratified random sampling to maintain a balanced representation across liver disease categories (Table S2 in ). Hyperparameter tuning was performed using a grid search to optimize model performance. Specifically, the regularization strength parameter (C) in the logistic regression model was adjusted to balance model fit and prevent overfitting [ ]. A range of C values (eg, 0.001 to 100) was evaluated, and the optimal configuration was selected based on performance on the test set. To maintain methodological rigor, the test set was used exclusively during hyperparameter tuning, while the validation set was reserved for final model evaluation. The 3-way split ensured an unbiased assessment of model generalizability. Key metrics, including accuracy, sensitivity, specificity, and the area under the receiver operating characteristic (ROC) curve (AUC), were used to quantify diagnostic precision.Ultrasound Image Analysis by Interactive Data Language–Based Software
Concurrently, the same ultrasound images were analyzed using an established interactive data language (IDL)–based tool designed for image analysis [
, ]. For this analysis, ROIs within the liver were manually defined using a specialized tool, which ensured the precise selection of the target areas based on the same selection criteria mentioned earlier. ROI delineation was performed manually by expert users, ensuring the precise inclusion of clinically relevant areas and the exclusion of artifacts. The ROIs were selected to resemble the same areas selected using ChatGPT-4. Following that, texture features describing the first-order and second-order histograms were extracted from ROIs. The same feature extraction and logistic regression methodology described above was applied, allowing for a direct comparison of the 2 approaches.Evaluating ChatGPT-4 for Imaging Findings–Based Diagnosis
To explore the potential of ChatGPT-4 as a tool for distinguishing normal from abnormal liver cases, we conducted an experiment involving liver ultrasound images representing various conditions. We uploaded these images to ChatGPT-4 and tasked it with providing detailed descriptions of the findings and possible diagnoses for each image (
). ChatGPT-4’s output included comprehensive imaging findings that described the characteristics of the liver tissue and suggested potential diagnoses based on these observations.Following this, we compared the diagnoses provided by ChatGPT-4 with the actual diagnoses to assess its diagnostic performance in identifying liver pathology. This involved calculating metrics such as sensitivity, specificity, and overall accuracy to determine how well ChatGPT-4 could identify normal and abnormal cases.

Statistical Analysis
To interpret the differences in ultrasound texture features among the 3 liver health categories, we calculated the mean values and SEs. A 1-way ANOVA was conducted to identify any statistical differences across the study groups.
When comparing 2 groups, the Shapiro-Wilk test was used to assess normality. If the data did not meet the normality assumption, the Mann-Whitney test was applied to determine significance, otherwise, statistical significance was evaluated using 2-tailed paired Student t tests, with a threshold of P<.05.
Diagnostic performance of individual features, and combined, including sensitivity, specificity, accuracy, and F1-score were calculated. To support the visualization of multiclass separability, an additional exploratory analysis was performed using a decision tree classifier. A one-vs-rest classification scheme was used to generate ROC curves and compute AUC values for each class: fibrosis, normal, and steatosis. In addition, the intraclass correlation coefficient (ICC) analysis was performed between 2 observers to assess the reproducibility of ChatGPT-derived features. All analyses were performed using MedCalc software (version 19.0.5; MedCalc Software Ltd).
Results
Multiclass Liver Disease Classification by ChatGPT-4–Based Ultrasound Radiomics
Identification of Key Features
The ultrasound radiomics data processed by ChatGPT-4 has provided significant insights into the textural characteristics associated with various liver diseases (
). An ANOVA analysis identified 9 key textural features (from 10 features studied)—echo intensity, heterogeneity, skewness, kurtosis, contrast, homogeneity, dissimilarity, ASM, and entropy—as significantly varying among different liver conditions.
Predictive Performance of Individual Features
Further analysis revealed varying degrees of accuracy, sensitivity, and specificity for the identified imaging features across different metrics (
). The accuracy of these features ranged from 0.48 to 0.62, with echointensity and entropy exhibiting the highest accuracy at 0.62. Specificity and sensitivity also varied, with echointensity showing a high specificity of 0.62 and entropy demonstrating a lower specificity at 0.42. Heterogeneity and skewness presented moderate accuracy levels at 0.57, with heterogeneity having slightly higher sensitivity. Energy stood out for its specificity at 0.66, while ASM, despite having the lowest sensitivity at 0.33, exhibited the highest specificity at 0.67. When these features were combined, the overall accuracy improved to 0.76, with a sensitivity of 0.83. An analysis of feature-wise F1-scores revealed also variability in their predictive contributions ( ). Echo intensity also exhibited the strongest performance (F1-score=0.85), while heterogeneity followed with F1-score of 0.67. Notably, the combined feature approach achieved F1-score (0.77), emphasizing the advantage of integrating multiple features, particularly weak ones.Feature | Accuracy | Sensitivity | Specificity | F1-score |
Echo-intensity | 0.62 | 0.56 | 0.62 | 0.85 |
Heterogeneity | 0.57 | 0.50 | 0.55 | 0.67 |
Skewness | 0.57 | 0.47 | 0.62 | 0.63 |
Kurtosis | 0.48 | 0.36 | 0.64 | 0.63 |
ASM | 0.48 | 0.33 | 0.67 | 0.42 |
Energy | 0.57 | 0.47 | 0.66 | 0.42 |
Contrast | 0.52 | 0.41 | 0.55 | 0.58 |
Dissimilarity | 0.52 | 0.41 | 0.55 | 0.58 |
Entropy | 0.62 | 0.60 | 0.42 | 0.56 |
Homogeneity | 0.52 | 0.41 | 0.60 | 0.54 |
ROC curve analysis for features combined using a decision tree classifier showed the following AUC values: 0.75 for fibrosis, 0.87 for normal, and 0.97 for steatosis (
). ROC comparison for individual features is shown in (Figure S1 in ). The comparison showed that echo intensity and heterogeneity were highest in fibrosis (0.91 and 0.86, respectively), suggesting increased structural disruption compared with steatosis and normal liver. ASM was highest in steatosis (0.88), reflecting greater textural uniformity, while fibrosis had the lowest value, indicative of higher heterogeneity. Contrast and dissimilarity, measures of local intensity variation, were most pronounced in fibrosis (0.79 and 0.73, respectively) and lowest in normal liver, reinforcing fibrosis’s greater textural complexity. Homogeneity and energy, which indicate texture smoothness and uniformity, were highest in normal liver (0.87 and 0.58, respectively), reflecting well-organized tissue architecture, and lowest in fibrosis, further supporting its structural disorganization.
Fibrosis | Steatosis | Normal | |
Echo intensity | 0.91 | 0.39 | 0.12 |
Heterogeneity | 0.86 | 0.33 | 0.23 |
Kurtosis | 0.22 | 0.51 | 0.82 |
Skewness | 0.22 | 0.45 | 0.87 |
Angular second momentum | 0.11 | 0.88 | 0.59 |
Correlation | 0.53 | 0.33 | 0.61 |
Dissimilarity | 0.73 | 0.39 | 0.33 |
Contrast | 0.79 | 0.42 | 0.22 |
Entropy | 0.44 | 0.90 | 0.21 |
Energy | 0.49 | 0.43 | 0.58 |
Homogeneity | 0.20 | 0.49 | 0.87 |
Reproducibility and Reliability
To assess the reproducibility of ChatGPT-4 outputs across users, 2 independent observers used the same ChatGPT-4–assisted workflow to select ROIs and extract radiomic features from the same ultrasound images. Both were trained physicians with clinical and research expertise in liver ultrasound. The ICC was calculated across the extracted radiomic features to quantify consistency. The results demonstrated high reproducibility for most features exceeding ICC of 0.8, with energy (ICC=0.96), correlation (ICC=0.92), and echo intensity (ICC=0.88) showing excellent agreement between observers (
). Entropy (ICC=0.81) and homogeneity (ICC=0.81) also indicated strong reliability, suggesting consistent feature extraction across different evaluators. Skewness (ICC=0.6) exhibited moderate agreement, while ASM showed the lowest ICC (ICC=0.25), indicating poor reproducibility for this metric.Feature | ICC |
Echo-intensity | 0.88 |
Heterogeneity | 0.90 |
Kurtosis | 0.39 |
Skewness | 0.60 |
Angular second momentum | 0.25 |
Correlation | 0.92 |
Dissimilarity | 0.78 |
Contrast | 0.89 |
Entropy | 0.81 |
Energy | 0.96 |
Homogeneity | 0.81 |
Binary Classification of Healthy Liver Versus Steatosis and Fibrosis Using ChatGPT-4 Ultrasound Radiomics
Significant distinctions were observed between normal liver and diseased conditions, particularly in 8 out of 10 analyzed features. For the binary comparison between normal and liver diseases (steatosis and fibrosis), 8 features showed significant differences (<0.05): echo-intensity (27.47 vs 50.47), heterogeneity (423.96 vs 687.17), skewness (0.95 vs 1.86), kurtosis (1.34 vs 5.24), energy (0.18 vs 0.22), contrast (30.65 vs 57.37), ASM (0.003 vs 0.001), and homogeneity (0.20 vs 0.38), for normal versus liver disease, respectively.
Comparing normal to fibrosis revealed significant differences in 8 features (<0.05): echo-intensity, heterogeneity, skewness, kurtosis, entropy, contrast, homogeneity, and correlation. For normal versus steatosis, 6 features showed significant differences: echo-intensity, entropy, skewness, kurtosis, Homogeneity, ASM, and energy. These mean values for the features are summarized in
.Echo-intensity, mean (SD) | Heterogeneity, mean (SD) | Entropy, mean (SD) | Skewness, mean (SD) | Kurtosis, mean (SD) | Energy, mean (SD) | Contrast, mean (SD) | Dissimilarity, mean (SD) | Homogeneity, mean (SD) | ASM, mean (SD) | |
Liver disease (fibrosis and steatosis) | 50.47 (2.34) | 687.17 (57.19) | 8.91 (0.28) | 0.95 (0.05) | 1.34 0.20 | 0.18 (0.03) | 57.37 (4.31) | 5.73 (0.26) | 0.20 (0.01) | 0.001 (0.00) |
Liver fibrosis | 56.67 (2.56) | 857.07 (65.62) | 7.99 (0.34) | 0.93 (0.07) | 1.18 (0.27) | 0.18 (0.03) | 64.65 (5.04) | 6.21 (0.29) | 0.19 (0.02) | 0.001 (0.00) |
Liver steatosis | 39.10 (2.88) | 366.24 (46.85) | 10.66 (0.06) | 0.97 (0.06) | 1.64 (0.29) | 0.19 (0.05) | 43.62 (6.71) | 4.81 (0.43) | 0.22 (0.02) | 0.001 (0.00) |
Normal | 27.44 (2.32) | 423.96 (57.93) | 8.15 (0.46) | 1.86 (0.15) | 5.24 (0.97) | 0.22 (0.03) | 30.65 (5.22) | 4.88 (0.77) | 0.38 (0.03) | 0.003 (0.00) |
Distinguishing Liver Disease by ChatGPT-4–Based Ultrasound Image Findings
The classification tool for liver ultrasound images exhibited strong diagnostic performance across 3 categories: normal liver, fibrosis, and steatosis. Achieving an overall accuracy of 77%, the tool demonstrated its potential in aiding radiological assessments. For normal liver conditions, the tool achieved a precision, recall, and F1-score of 0.75, indicating reliable detection accuracy. In the case of fibrosis, the tool excelled with a perfect recall of 1.00, meaning it successfully identified all fibrosis cases, and an F1-score of 0.86, with a precision of 0.75. This highlights its robustness in diagnosing fibrotic conditions without missing any positive cases. However, for steatosis, while the tool showed a high precision of 0.80, the recall was slightly lower at 0.67, leading to an F1-score of 0.73. This indicates a strong ability to correctly identify steatosis when predicted, though there is room for improvement in sensitivity.
The macroaveraged metrics (precision=0.77, recall=0.81, and F1-score=0.78) and weighted averages (precision=0.77, recall=0.77, and F1-score=0.76) further underscore the tool’s balanced performance across different liver conditions. These results suggest that while the tool is already valuable for distinguishing normal and abnormal liver conditions, further refinements could enhance its sensitivity, particularly for steatosis.
Evaluation of IDL-Based Ultrasound Radiomics for Liver Disease Classification
Identification of Key Textural Features
The radiomics analysis of liver ultrasound images conducted using IDL has provided significant insights into the textural characteristics associated with various liver diseases. Through ANOVA analysis, 9 textural features were identified as significantly varying among groups with different liver conditions (
).
Predictive Performance of Features
The predictive performance of features of these ultrasound imaging features varied, with accuracy ranging from 0.47 to 0.76 sensitivity, from 0.33 to 0.73, and specificity from 0.43 to 0.70 (
). The feature “echo-intensity” demonstrated the highest performance with an accuracy of 0.76, sensitivity of 0.73, and specificity of 0.53, indicating balanced performance. Similarly, “Heterogeneity” also showed an accuracy of 0.76, with a sensitivity of 0.68 and a specificity of 0.51. On the other hand, “Kurtosis” had lower accuracy at 0.48 and sensitivity at 0.38, but a higher specificity of 0.64, highlighting its strength in correctly identifying true negative cases. Integrating multiple textural features enhances diagnostic performance. By combining the features, the overall accuracy improved to 0.77, with a notable accuracy of 0.89. Similarly, F1-score performance varied across features, with echo intensity achieving the highest F1-score (0.84), indicating its superior predictive power. Heterogeneity also performed well, with an F1-score of 0.56. In contrast, kurtosis, ASM, entropy, and correlation had the lowest F1-scores (ranging from 0.26), reflecting weaker predictive contributions. Notably, the combined feature approach achieved the highest F1-score (0.81), emphasizing the advantage of integrating multiple features to enhance predictive accuracy.Feature | Accuracy | Sensitivity | Specificity | F1-score |
Echo-intensity | 0.76 | 0.73 | 0.53 | 0.84 |
Heterogeneity | 0.76 | 0.68 | 0.51 | 0.56 |
Kurtosis | 0.47 | 0. 38 | 0.64 | 0.38 |
Skewness | 0.67 | 0.58 | 0.55 | 0.48 |
ASMa | 0.48 | 0.33 | 0.56 | 0.26 |
Entropy | 0.48 | 0.33 | 0.43 | 0.26 |
GLCMb_mean | 0.67 | 0.56 | 0.51 | 0.48 |
GLCM_variance | 0.62 | 0.59 | 0.53 | 0.57 |
Correlation | 0.48 | 0.33 | 0.7 | 0.26 |
aASM: angular second momentum.
bGLCM: gray-level co-occurrence matrix.
Comparison Between ChatGPT and IDL Features
Correlation and Agreement Analysis Between Feature Sets
To assess the relationship between ChatGPT-4-derived features and IDL-based features, we performed correlation and agreement analyses. Each feature set was consolidated into a single value using multiple regression, allowing for a direct, one-to-one comparison between the 2 methods. The multiple linear regression model using ordinary least squares was applied to combine all extracted radiomic features into a single predicted value per image, with the liver disease category as the dependent variable. This was done separately for ChatGPT-4 and IDL outputs to generate comparable summary values. The results showed a moderate positive correlation (r=0.64) across all extracted features, which was statistically significant (P<.001;
). In other words, increases in ChatGPT-4 feature values tended to coincide with increases in IDL feature values, albeit with some variability. While this correlation is not perfect, it demonstrates that ChatGPT-4–derived features are reasonably well aligned with IDL features, supporting the feasibility of ChatGPT-4 for ultrasound radiomics analysis.
We further examined the correlation between the 2 software packages by focusing on 7 common features extracted by ChatGPT-4 and IDL showing a correlation (r) of 0.68 (
). The degree of correlation varied among the individual features, with the strongest correlation observed for combined features (r=0.68, P<.001). Notably, first-order histogram measures such as echo-intensity (r=0.60) and kurtosis (r=0.52) showed stronger correlations, whereas GLCM-based features exhibited weaker alignment. In particular, measures like correlation and kurtosis derived from GLCM demonstrated lower correlations.
To further assess agreement between features extracted by the 2 software, a Bland-Altman analysis was performed (Figure S2 in
). The results demonstrated that combined features exhibited the best agreement, with narrow agreement limits and minimal bias, reinforcing their robustness. Skewness showed particularly strong agreement, indicating interchangeability between ChatGPT-4 and IDL for this feature. Minimal proportional bias was observed for well-correlated features, supporting the feasibility of using ChatGPT-4 for radiomics analysis in this context. Based on these results, the agreement between ChatGPT-4 and IDL-derived features can be categorized into three levels: (1) strong agreement (reliable and interchangeable): skewness and correlation; (2) moderate agreement (requires minor adjustments): ASM, entropy, and echo-intensity; and (3) weak agreement (fundamental differences requiring major corrections): kurtosis and heterogeneity.Processing Time Comparison
In addition to feature correlation, we also compared batch processing efficiency between ChatGPT-4 and IDL-based tools. The results demonstrated that ChatGPT-4 outperformed IDL in processing speed. ChatGPT-4 completed the entire analysis process—including ROI selection, refinement, and texture analysis—in 4 minutes and 12 seconds for a batch of 10 images (the maximum batch size). In contrast, IDL required approximately 50 seconds per case, totaling over 8 minutes for the same batch. ChatGPT, therefore, showed more than a 40% reduction in processing time, highlighting ChatGPT-4’s efficiency in automated batch processing. These findings suggest that ChatGPT-4 provides a viable alternative for high-throughput ultrasound radiomics analysis, offering both speed and reasonable alignment with IDL-based feature extraction.
Discussion
Expanding ChatGPT-4’s Role in Radiology
AI and natural language processing tools, such as ChatGPT, have been increasingly explored for their role in enhancing radiology workflows [
]. Recent studies demonstrated how ChatGPT can be integrated into radiology workflows to improve efficiency in patient registration, scheduling, image acquisition, interpretation, and reporting [ , ]. The findings of these studies highlight ChatGPT’s potential to streamline repetitive tasks, reduce radiologist workload, and enhance communication in diagnostic imaging. Our study builds upon this foundation by extending the role of ChatGPT-4 beyond workflow optimization into advanced radiomics analysis. Specifically, we evaluate ChatGPT-4’s ability to extract quantitative ultrasound texture features, distinguish between different liver disease states, and compare its performance against conventional radiomics software. By bridging workflow optimization with diagnostic analysis, our findings contribute to the ongoing evolution of AI-assisted radiology, reinforcing ChatGPT’s potential as a tool for both administrative and analytical applications in medical imaging.ChatGPT-4’s Diagnostic Performance and Reproducibility
Our study results show that ChatGPT-4’s radiomic analysis exhibited robust performance in distinguishing among the 3 liver pathology groups, achieving a sensitivity of 0.83 and AUC exceeding 0.75, when all radiomic features were combined. While the diagnostic utility of individual features varied, the aggregated analysis compensated for weaker predictors, thereby enhancing overall classification accuracy. Moreover, the high ICC values observed between independent observers (reaching 0.92) suggest excellent reproducibility, reinforcing the robustness of ChatGPT-4–derived texture parameters in ultrasound imaging. However, not all features demonstrated high reliability; for instance, ASM yielded an ICC of 0.25, indicating poor reproducibility. Such discrepancies in intraclass agreement for extracted ultrasound features can arise from small differences in ROI placement, even if they are close. Ultrasound images are highly sensitive to pixel-level changes, which can affect texture-based features. Additionally, interpolation effects, quantization errors, and software implementation variability can contribute to differences. These findings underscore the need for further refinement in feature extraction methodologies, particularly for features with lower reproducibility. Future research should prioritize the standardization of algorithms to enhance observer consistency, ensuring that AI-generated radiomic features are both reliable and clinically actionable.
Interpretation of the Radiomic Biomarkers in Liver Disease
The radiomic biomarkers identified in this study align with established pathophysiological changes in liver disease. Increased heterogeneity and entropy, for instance, reflect greater structural disorder where excessive collagen deposition disrupts tissue uniformity, consistent with fibrosis [
, ]. GLCM-based texture features provide additional microstructural insights, for example, ASM (or energy) serving as an index of texture uniformity—higher values indicate preserved architecture, while lower values suggest structural disruption, such as that seen in fibrosis [ ]. These features may serve as robust, noninvasive biomarkers for disease detection and monitoring. Our results showed that distinct textural patterns can be related to different liver conditions. Fibrosis presents with increased echogenicity, heterogeneity, and contrast, indicating architectural disruption. While steatosis also exhibits high echogenicity, it is associated with a smoother, more homogeneous texture, suggesting uniform yet structurally altered tissue. In contrast, normal liver maintains the most uniform texture, with high homogeneity and low contrast, reflecting preserved tissue organization.Comparison of ChatGPT-4 With Traditional Image Analysis Software
Direct comparison between features extracted by ChatGPT-4 and IDL revealed a moderate correlation (r=0.68), when features are combined with notable variations between specific features on an individual basis. First-order features, which primarily assess pixel intensity distributions, exhibited strong agreement between the 2 platforms, whereas GLCM-based features showed greater discrepancies. This discrepancy is likely attributable to differences in pixel adjacency definitions, quantization methods, and sampling protocols across the 2 analytical frameworks. These results highlight a persistent challenge in radiomics: reproducibility across different software implementations. Variability in image acquisition parameters, preprocessing steps, and computational feature extraction methodologies can significantly impact radiomic feature consistency. Prior studies have underscored the necessity of harmonized radiomic pipelines to enhance cross-platform reproducibility [
, ]. Establishing standardized radiomic workflows will be critical for ensuring the clinical applicability of AI-driven ultrasound analysis.A key advantage of ChatGPT-4 in this study was its ability to process multiple images in parallel, demonstrating significant efficiency gains over conventional software. Notably, processing time was reduced by more than 40% compared with IDL, suggesting that AI-driven tools can significantly enhance radiological workflow efficiency. This capability is particularly valuable in research settings requiring high-throughput image analysis, as well as in clinical environments where real-time assessment is essential for guiding interventional procedures. Moreover, ChatGPT-4’s scalability supports its application in large-scale imaging studies, enabling rapid dataset processing while minimizing manual input. This efficiency could facilitate applications in population-based screening programs, multicenter trials, and AI-assisted educational platforms. While compute capacity was controlled for in this study, we acknowledge that hardware variability can influence software performance. Future work should evaluate AI efficiency across diverse computing environments to better account for system-dependent constraints. Despite its slightly lower diagnostic performance compared with IDL, the results are encouraging given that ChatGPT-4 was not originally designed for medical image analysis. With additional domain-specific training and fine-tuning using large-scale ultrasound datasets, its performance is expected to improve. Future research should explore ChatGPT-4’s integration into routine radiology workflows, particularly in triage settings, where automated interpretation of liver ultrasound images could expedite clinical decision-making and optimize resource allocation.
Limitations and Challenges
Session Variability and Model Robustness
Despite its promising performance, ChatGPT-4 exhibited session-dependent variability in feature extraction. This phenomenon, which possibly arises from differences in how the model processes context and maintains internal states across separate analyses, introduces potential inconsistencies in feature reproducibility. While batch analyses remained stable, independent session resets occasionally yielded variations in extracted parameters. Session-dependent variability is a recognized limitation of large language models [
- ] and warrants further investigation in the context of medical imaging. To mitigate this challenge, we refined our prompting strategies, ensuring that feature extraction parameters were explicitly aligned across sessions. While steps were taken to standardize ChatGPT-4 prompts and maintain session continuity, variability in output due to the model’s inherent stochastic nature remains a limitation. Although incomplete feature sets were addressed through repeated prompting and prompt refinement, future studies may also benefit from averaging outputs across multiple runs or sessions to account for variability and enhance consistency. Additionally, future research should prioritize the development of standardized initialization protocols and structured prompt engineering strategies to improve the reproducibility of AI-driven radiomic analyses.Automated ROI Selection
A key limitation of ChatGPT-4 is its fully automated ROI selection, which lacks the flexibility and precision needed for clinical applications. This may affect diagnostic accuracy, especially when critical pathological features fall outside the AI-defined ROI. While ChatGPT-4 does not allow direct manual ROI adjustments, we used a hybrid approach [
], iteratively refining prompts to guide the model until the desired ROI was accurately identified. This method combined AI-driven automation with user oversight, improving ROI placement and reducing errors. Future iterations of ChatGPT-4 could enhance clinical applicability by incorporating interactive manual ROI modifications [ ]. Additionally, integrating advanced machine learning algorithms could refine automated ROI selection, allowing AI to prioritize clinically relevant areas [ ]. A promising direction is the development of hybrid models that preselect an ROI while allowing clinician refinement, balancing automation with expert oversight [ ].Preclinical Model and Clinical Translatability
Our preclinical liver disease model closely mirrors human pathology, with histological findings aligning well with clinical presentations of fibrosis and steatosis. This translatability strengthens the relevance of our results; the model has undergone extensive validation to ensure robustness and suitability for studying liver disease [
- ]. Nonetheless, this study serves as an initial assessment of ChatGPT-4 in a preclinical setting. Future work will extend to human liver ultrasound datasets, potentially involving diverse populations and multiple medical centers to enhance generalizability. Importantly, moving to clinical datasets raises privacy and ethical concerns, requiring strict compliance with HIPAA (Health Insurance Portability and Accountability Act), GDPR (General Data Protection Regulation), and other data security frameworks. Additionally, AI bias—stemming from skewed or nonrepresentative training data—remains a critical challenge, necessitating multicenter validation to ensure fairness and accuracy across varied clinical settings.Uncertainty in AI-Generated Diagnoses
Reliable AI outputs are critical in medical imaging. Currently, ChatGPT-4 lacks inherent uncertainty quantification. Integrating probabilistic methods could improve reliability by assigning confidence levels based on prior data distributions, similar to deep learning–based radiomics [
]. Monte Carlo dropout modeling could provide uncertainty intervals, flagging cases needing further review [ ]. Ensemble modeling could further enhance reliability through consensus-based confidence scores [ ]. Explainability improvements, such as structured reasoning frameworks, would support informed decision-making [ ]. Implementing these methods would ensure ChatGPT-4 functions as a decision-support tool rather than an autonomous diagnostic system.Clinical Applications and Future Directions
This study highlights ChatGPT-4’s potential in medical imaging, particularly in image interpretation. While CNNs have achieved over 90% accuracy in tasks like liver fibrosis staging [
, ], ChatGPT-4 offers distinct advantages by integrating image analysis with narrative generation and enabling interactive ROI refinement [ ]. Unlike traditional deep learning models requiring extensive training, ChatGPT-4’s adaptability supports multimodal integration, making it a promising tool for clinical applications. Fine-tuning on specialized datasets could enhance its diagnostic accuracy, bridging the gap between specialized AI models and broader usability. Enhancing ChatGPT-4’s clinical utility involves several advancements. Transfer learning can improve domain-specific accuracy by incorporating structured radiology reports and labeled diagnostic cases [ ]. Multimodal training could allow it to analyze medical images alongside textual and radiomic data, improving correlation with clinical insights [ ]. Real-time clinical decision support through interactive learning could refine outputs, while integrating longitudinal patient data could enhance disease monitoring, particularly for chronic conditions [ ]. Future research should compare ChatGPT-4 with leading deep learning models to evaluate its role in multimodal medical imaging.Integrating ChatGPT-4 into clinical workflows has the potential to enhance diagnostic efficiency by streamlining triage, anomaly detection, and preliminary report generation. AI-driven tools have demonstrated the ability to expedite time-to-diagnosis by prioritizing critical imaging findings [
, ]. In liver ultrasound, ChatGPT-4 could assist by distinguishing normal scans or minor abnormalities, allowing radiologists to focus on complex cases. Its radiomic analysis capabilities may facilitate early disease detection, akin to AI models that have identified microvascular changes in brain imaging and tumor margins in mammography [ , ]. Automated report generation is another promising application, as AI-generated reports have been shown to match human interpretation in accuracy [ , ]. Additionally, real-time feedback during ultrasound-guided procedures and batch-processing for large-scale imaging analysis could support multicenter studies and population-based disease screening [ , ]. Despite its potential, key challenges include validation across diverse populations, regulatory approval, and clinician training. Prospective studies are needed to assess ChatGPT-4’s impact on diagnostic accuracy, workflow efficiency, and patient outcomes. Addressing these challenges could establish ChatGPT-4 as a transformative tool in radiology, optimizing early disease detection and clinical workflows.Conclusions
In conclusion, our study confirms the feasibility of using ChatGPT-4 for liver disease diagnosis through ultrasound image analysis, emphasizing its potential to assist radiologists in making more accurate diagnoses. Despite some limitations, ChatGPT-4's ability to efficiently handle large-scale image datasets and its robust feature extraction capabilities make it a valuable tool for enhancing diagnostic accuracy and supporting radiological decision-making. By integrating ChatGPT-4 into radiological workflows, radiologists can leverage its capabilities to improve the precision and efficiency of liver ultrasound image analysis. This tool's potential to manage vast amounts of data with high efficiency is particularly appealing in modern medical research and clinical practice.
Conflicts of Interest
None declared.
Additional figures and tables.
DOCX File , 685 KBReferences
- Oren O, Gersh BJ, Bhatt DL. Artificial intelligence in medical imaging: switching from radiographic pathological data to clinically meaningful endpoints. Lancet Digit Health. 2020;2(9):e486-e488. [FREE Full text] [CrossRef] [Medline]
- Kim HE, Kim HH, Han BK, Kim KH, Han K, Nam H, et al. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit Health. 2020;2(3):e138-e148. [CrossRef]
- van den Heuvel TL, van der Eerden AW, Manniesing R, Ghafoorian M, Tan T, Andriessen T, et al. Automated detection of cerebral microbleeds in patients with traumatic brain injury. Neuroimage Clin. 2016;12:241-251. [CrossRef] [Medline]
- Becker AS, Marcon M, Ghafoor S, Wurnig MC, Frauenfelder T, Boss A. Deep learning in mammography. Invest Radiol. 2017;52(7):434-440. [CrossRef]
- Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019;1(6):e271-e297. [CrossRef]
- Bello GA, Dawes TJW, Duan J, Biffi C, de Marvao A, Howard LSGE, et al. Deep learning cardiac motion analysis for human survival prediction. Nat Mach Intell. 2019;1(2):95-104. [FREE Full text] [CrossRef] [Medline]
- Hosny A, Parmar C, Quackenbush J, Schwartz LH, Aerts HJWL. Artificial intelligence in radiology. Nat Rev Cancer. 2018;18(8):500-510. [FREE Full text] [CrossRef] [Medline]
- Di Serafino M, Iacobellis F, Schillirò ML, D'auria D, Verde F, Grimaldi D, et al. Common and uncommon errors in emergency ultrasound. Diagnostics (Basel). 2022;12(3):631. [FREE Full text] [CrossRef] [Medline]
- Pinto A, Pinto F, Faggian A, Rubini G, Caranci F, Macarini L, et al. Sources of error in emergency ultrasonography. Crit Ultrasound J. 2013;5(S1). [FREE Full text] [CrossRef]
- Le MT, Voigt L, Nathanson R, Maw AM, Johnson G, Dancel R, et al. Comparison of four handheld point-of-care ultrasound devices by expert users. Ultrasound J. 2022;14(1):27. [FREE Full text] [CrossRef] [Medline]
- Malik AN, Rowland J, Haber BD, Thom S, Jackson B, Volk B, et al. The use of handheld ultrasound devices in emergency medicine. Curr Emerg Hosp Med Rep. 2021;9(3):73-81. [FREE Full text] [CrossRef] [Medline]
- Kim YH. Artificial intelligence in medical ultrasonography: driving on an unpaved road. Ultrasonography. 2021;40(3):313-317. [FREE Full text] [CrossRef] [Medline]
- Komatsu M, Sakai A, Dozen A, Shozu K, Yasutomi S, Machino H, et al. Towards clinical application of artificial intelligence in ultrasound imaging. Biomedicines. 2021;9(7):720. [FREE Full text] [CrossRef] [Medline]
- Kayarian F, Patel D, O'Brien JR, Schraft EK, Gottlieb M. Artificial intelligence and point-of-care ultrasound: benefits, limitations, and implications for the future. Am J Emerg Med. 2024;80:119-122. [CrossRef] [Medline]
- Zhang J, Dawkins A. Artificial intelligence in ultrasound imaging: where are we now? Ultrasound Q. 2024;40(2):93-97. [CrossRef] [Medline]
- Zhang L, Chen T, Wu H. Recent advances in artificial intelligence-assisted ultrasound scanning. Appl Sci. 2023;13(6):3693. [CrossRef]
- Dicle O. Artificial intelligence in diagnostic ultrasonography. Diagn Interv Radiol. 2023;29(1):40-45. [FREE Full text] [CrossRef] [Medline]
- Else H. Abstracts written by ChatGPT fool scientists. Nature. 2023;613(7944):423. [FREE Full text] [CrossRef] [Medline]
- Dwivedi YK, Kshetri N, Hughes L, Slade E, Jeyaraj A, Kar AK, et al. Opinion paper: “So what if ChatGPT wrote it?” multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inf Manag. 2023;71:102642. [FREE Full text] [CrossRef]
- Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology. 2023;307(2):e230171. [CrossRef] [Medline]
- Jeblick K, Johnson L, Arora S. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. arXiv. 2023;34(5):1-9. [FREE Full text] [CrossRef]
- Alhasan K, Al-Tawfiq J, Aljamaan F, Jamal A, Al-Eyadhy A, Temsah MH. Mitigating the burden of severe pediatric respiratory viruses in the post-COVID-19 era: ChatGPT insights and recommendations. Cureus. 2023;15(3):e36263. [FREE Full text] [CrossRef] [Medline]
- Ali SR, Dobbs TD, Hutchings HA, Whitaker IS. Using ChatGPT to write patient clinic letters. Lancet Digit Health. 2023;5(4):e179-e181. [FREE Full text] [CrossRef]
- Santandreu-Calonge D, Ortiz-Martinez Y, Agusti-Toro A. Can ChatGPT improve communication in hospitals? Profesional de la información. 2023;32(2). [FREE Full text] [CrossRef]
- Biswas SS. Role of ChatGPT in radiology with a focus on pediatric radiology: proof by examples. Pediatr Radiol. 2023;53(5):818-822. [CrossRef] [Medline]
- Rao AS, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv. 2023:2023. [FREE Full text] [CrossRef] [Medline]
- Wiggers K. OpenAI releases GPT-4, a multimodal AI that it claims is state-of-the-art? TechCrunch. 2023. URL: https://techcrunch.com/2023/03/14/openai-releases-gpt-4-ai-that-it-claims-is-state-of-the-art/ [accessed 2023-03-15]
- Bubeck S, Varun C, Ronen E, Johannes G, Eric H, Ece K, et al. Sparks of artificial general intelligencearly experiments with GPT-4? arXiv.12712cs.CL. 2023. [FREE Full text]
- Sultan LR, Mohamed MK, Andronikou S. ChatGPT-4: a breakthrough in ultrasound image analysis. Radiol Adv. 2024;1(1):umae006. [CrossRef]
- Wu S, Wang X, Xing W, Li F, Liang M, Li K, et al. An update on animal models of liver fibrosis. Front Med (Lausanne). 2023;10:1160053. [FREE Full text] [CrossRef] [Medline]
- D'Souza JC, Sultan LR, Hunt SJ, Schultz SM, Brice AK, Wood AKW, et al. B-mode ultrasound for the assessment of hepatic fibrosis: a quantitative multiparametric analysis for a radiomics approach. Sci Rep. 2019;9(1):8708. [FREE Full text] [CrossRef] [Medline]
- Sultan LR, Cary TW, Al-Hasani M, Karmacharya MB, Venkatesh SS, Assenmacher C, et al. Can sequential images from the same object be used for training machine learning models? a case study for detecting liver disease by ultrasound radiomics. AI (Basel). 2022;3(3):739-750. [FREE Full text] [CrossRef] [Medline]
- Al-Hasani M, Sultan LR, Sagreiya H, Cary TW, Karmacharya MB, Sehgal CM. Ultrasound radiomics for the detection of early-stage liver fibrosis. Diagnostics (Basel). 2022;12(11):2737. [FREE Full text] [CrossRef] [Medline]
- Liao YY, Yang KC, Lee MJ, Huang KC, Chen JD, Yeh CK. Multifeature analysis of an ultrasound quantitative diagnostic index for classifying nonalcoholic fatty liver disease. Sci Rep. 2016;6:35083. [FREE Full text] [CrossRef] [Medline]
- Suganya R, Rajaram S. Feature extraction and classification of ultrasound liver images using haralick texture-primitive features: application of SVM classifier. 2013. Presented at: 2013 International Conference on Recent Trends in Information Technology (ICRTIT); 2013 July 25:596-602; Chennai, India. [CrossRef]
- Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. In: Data Mining, Inference, and Prediction. New York. Springer; 2009.
- Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression. USA. John Wiley & Sons; 2013.
- Sultan LR, Chen YT, Cary TW, Ashi K, Sehgal CM. Quantitative pleural line characterization outperforms traditional lung texture ultrasound features in detection of COVID-19. J Am Coll Emerg Physicians Open. 2021;2(2):e12418. [FREE Full text] [CrossRef] [Medline]
- Mese I, Taslicay CA, Sivrioglu AK. Improving radiology workflow using ChatGPT and artificial intelligence. Clin Imaging. 2023;103:109993. [FREE Full text] [CrossRef] [Medline]
- Lyu Q, Tan J, Zapadka ME, Ponnatapura J, Niu C, Myers KJ, et al. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Vis Comput Ind Biomed Art. 2023;6(1):9. [FREE Full text] [CrossRef] [Medline]
- Elkassem AA, Smith AD. Potential use cases for ChatGPT in radiology reporting. Am J Roentgenol. 2023;221(3):373-376. [FREE Full text] [CrossRef]
- Byra M, Styczynski G, Szmigielski C, Kalinowski P, Michałowski Ł, Paluszkiewicz R, et al. Transfer learning with deep convolutional neural network for liver steatosis assessment in ultrasound images. Int J Comput Assist Radiol Surg. Dec 2018;13(12):1895-1903. [FREE Full text] [CrossRef] [Medline]
- Park J, Lee JM, Lee G, Jeon SK, Joo I. Quantitative evaluation of hepatic steatosis using advanced imaging techniques: focusing on new quantitative ultrasound techniques. Korean J Radiol. Jan 2022;23(1):13-29. [FREE Full text] [CrossRef] [Medline]
- Wang Y, Zhang X-J. Role of radiomics in staging liver fibrosis: a meta-analysis. BMC Med Imaging. Apr 12, 2024;24(1):87-554. [FREE Full text] [CrossRef] [Medline]
- Traverso A, Wee L, Dekker A, Gillies R. Repeatability and reproducibility of radiomic features: a systematic review. Int J Radiat Oncol Biol Phys. 2018;102(4):1143-1158. [FREE Full text] [CrossRef] [Medline]
- Thomas HMT, Wang HYC, Varghese AJ, Donovan EM, South CP, Saxby H, et al. Reproducibility in radiomics: a comparison of feature extraction methods and two independent datasets. Appl Sci (Basel). 2023;166(1):7291. [FREE Full text] [CrossRef] [Medline]
- Radford A, Wu J, Child R, Luan D, Amodei D. Language models are unsupervised multitask learners. OpenAI Blog. 2019. [FREE Full text]
- OpenAI. GPT-4 technical report. arXiv. 2023. URL: https://arxiv.org/abs/2303.08774 [accessed 2025-05-21]
- Smith SM, Beckmann CF, Ramnani N, Woolrich MW, Bannister PR, Jenkinson M, et al. Variability in fMRI: a re-examination of inter-session differences. Hum Brain Mapp. 2005;24(3):248-257. [FREE Full text] [CrossRef] [Medline]
- Zhu Z, Albadawy E, Saha A, Zhang J, Harowicz MR, Mazurowski MA. Deep learning for identifying radiogenomic associations in breast cancer. Comput Biol Med. Jun 2019;109(3):85-90. [FREE Full text] [CrossRef] [Medline]
- Mazurowski MA, Buda M, Saha A, Bashir MR. Deep learning in radiology: an overview of the concepts and a survey of the state of the art with focus on MRI. J Magn Reson Imaging. 2019;49(4):939-954. [FREE Full text] [CrossRef] [Medline]
- Ardila D, Kiraly AP, Bharadwaj S, Choi B, Reicher JJ, Peng L, et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med. 2019;25(6):954-961. [CrossRef] [Medline]
- McBee MP, Awan OA, Colucci AT, Ghobadi CW, Kadom N, Kansagra AP, et al. Deep learning in radiology: a review of current applications and future directions. Acad Radiol. 2018;25(11):1472-1480. [CrossRef] [Medline]
- Wang Y. Uncertainty-aware AI in medical decision support: a review of bayesian and probabilistic approaches. J Med AI. 2023;5(2):120-135.
- Gal Y, Ghahramani Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning. 2016. Presented at: Proceedings of the 33rd International Conference on Machine Learning (ICML); 2016 June 19:1050-1059; New York, NY.
- Tiu E, Talius E, Patel P, Langlotz CP, Ng AY, Rajpurkar P. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat Biomed Eng. Dec 2022;6(12):1399-1406. [FREE Full text] [CrossRef] [Medline]
- Holzinger A, Carrington A, Müller H. Measuring the quality of explanations: the system causability scale (SCS): comparing human and machine explanations. Kunstliche Intell (Oldenbourg). May 31, 2020;34(2):193-198. [FREE Full text] [CrossRef] [Medline]
- Yin C, Zhang H, Du J, Zhu Y, Zhu H, Yue H. Artificial intelligence in imaging for liver disease diagnosis. Front Med (Lausanne). 2025;12(4):1591523-1591215. [FREE Full text] [CrossRef] [Medline]
- Idrees H. Transfer learning in natural language processing: a game-changer for AI models. Medium. 2023. URL: https://tinyurl.com/36d7se5y [accessed 2025-02-01]
- Rajpurkar P, Chen E, Banerjee O, Topol E. AI in health and medicine. Nat Med. Jan 20, 2022;28(1):31-38. [CrossRef]
- Niu S, Ma J, Yin Q, Wang Z, Bai L, Yang X. Modelling patient longitudinal data for clinical decision support: a case study on emerging ai healthcare technologies. Inf Syst Front. Jul 18, 2024;27(2):409-427. [CrossRef]
Abbreviations
AI: artificial intelligence |
ASM: angular second momentum |
AUC: area under the receiver operating characteristic curve |
GDPR: General Data Protection Regulation |
GLCM: gray-level co-occurrence matrix |
HIPAA: Health Insurance Portability and Accountability Act |
ICC: intraclass correlation coefficient |
IDL: interactive data language |
ROC: receiver operating characteristic |
ROI: region of interest |
Edited by K El Emam; submitted 29.10.24; peer-reviewed by E Montin, N Nanthasamroeng; comments to author 15.12.24; revised version received 28.02.25; accepted 18.05.25; published 30.06.25.
Copyright©Laith R Sultan, Shyam Sunder B Venkatakrishna, Sudha A Anupindi, Savvas Andronikou, Michael R Acord, Hansel J Otero, Kassa Darge, Chandra M Sehgal, John H Holmes. Originally published in JMIR AI (https://ai.jmir.org), 30.06.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR AI, is properly cited. The complete bibliographic information, a link to the original publication on https://www.ai.jmir.org/, as well as this copyright and license information must be included.