Funding

JMIR AI

2817-1705

JMIR Publications

Toronto, Canada

v5i1e91981

10.2196/91981

Letter to the Editor

Authors’ Reply: Toward Retrieval-Grounded Evaluation for Conversational Large Language Model–Based Risk Assessment

Roshani

Mohammad Amin

MS1Zhou

Xiangyu

MS1Qiang

Yao

PhD2Suresh

Srinivasan

MD3Hicks

Steve

MD4Sethuraman

Usha

MD5Zhu

Dongxiao

PhD1

Department of Computer Science, Wayne State University

5057 Woodward Ave

Detroit

United StatesOakland University

Rochester

United StatesUPMC Children's Hospital of Pittsburgh

Pittsburgh

United StatesDepartment of Pediatrics, Pennsylvania State University

Hershey

United StatesChildren's Hospital of Michigan

Detroit

United States

Coristine

Andrew

Correspondence to Dongxiao Zhu, PhD, Department of Computer Science, Wayne State University, 5057 Woodward Ave, Detroit, MI, 48202, United States, 1 3135773104; dzhu@wayne.edu

2026

1232026

e91981

3001202617022026

2026

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR AI, is properly cited. The complete bibliographic information, a link to the original publication on https://www.ai.jmir.org/, as well as this copyright and license information must be included.

https://ai.jmir.org/2025/1/e67363

https://ai.jmir.org/2026/1/e90759

personalized risk assessmentlarge language modelartificial intelligenceconversational AICOVID-19

We thank the authors of the letter to the editor [1] for their thoughtful perspective and for highlighting the potential value of retrieval-grounded sensitivity analyses alongside conventional area under the receiver operating characteristic curve (AUC) reporting. We welcome this opportunity to further clarify the design choices and evaluation scope of our published work.

As described in Roshani et al [2], the system explicitly provides 2 distinct user-facing interfaces: a clinician-facing interface and a patient-facing interface (these are shown in Figure 3 in Roshani et al [2]), each designed for a different purpose. We agree that subgroup-level audits are essential; accordingly, our clinician-facing interface already supports stratified performance review (eg, AUC by demographic variables), laying the groundwork for more systematic fairness analyses in future versions.

In contrast, the patient-facing interface prioritizes interpretability and usability rather than aggregate performance metrics. It presents both a binary classification (severe vs nonsevere) and an individualized continuous risk score derived from the model’s logit output, where higher values indicate greater severity. This output is further complemented by attention-based feature importance to support transparent, conversational risk assessment. Notably, the patient-facing interface produces no free text in clinical language, so the risk of hallucinated clinical statements is not applicable to our system’s current design.

Regarding retrieval grounding, the letter raises an important direction for future evaluation. We note that retrieval-based augmentation requires the availability of stable, domain-specific corpora. In the published study, our primary objective was to investigate the feasibility and performance of generative white-box large language models (LLMs) in low-data, emergent-disease settings. As such, we primarily considered a novel infection scenario, in which trusted external knowledge sources may be sparse, incomplete, or evolving. This was the case for pediatric COVID-19 during its early emergence, when authoritative, age-specific guidelines were limited, making LLM-only approaches more practical at the time. In this context, the system is intended to provide an initial, clinician-informed risk assessment, leveraging a small number of curated cases incorporated during fine-tuning, rather than relying on external retrieval. We therefore presented the reported system as an initial in-house baseline, designed to function in early-stage or data-limited settings.

As clinical knowledge bases mature, the same conversational pipeline can naturally be extended to incorporate explicit evidence grounding and source citation, enabling more personalized and evidence-supported risk assessment. We appreciate the authors’ comments in highlighting this direction and view it as complementary to, rather than in conflict with, the scope and objectives of the published work.

Funding

Research reported in this publication was supported by the Eunice Kennedy Shriver Institute of Child Health and Human Development of the National Institute of Health under awards R61HD105610 and R33HD105610.

SH is named as a co-inventor on a patent for the diagnostic use of salivary RNA in neurologic disorders. He previously served as a scientific advisory board member for Quadrant Biosciences and Spectrum Solutions. All other authors declare no conflicts of interest.

Abbreviations

AUC

area under the receiver operating characteristic curve

LLM

large language model

References1

Toward retrieval-grounded evaluation for conversational large language model–based risk assessment

JMIR AI20265e90759

10.2196/90759

Roshani

Zhou

Qiang

Generative large language model-powered conversational AI app for personalized risk assessment: case study in COVID-19

JMIR AI202503274e67363

10.2196/67363

40146990