<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "journalpublishing.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" article-type="letter"><front><journal-meta><journal-id journal-id-type="nlm-ta">JMIR AI</journal-id><journal-id journal-id-type="publisher-id">ai</journal-id><journal-id journal-id-type="index">41</journal-id><journal-title>JMIR AI</journal-title><abbrev-journal-title>JMIR AI</abbrev-journal-title><issn pub-type="epub">2817-1705</issn><publisher><publisher-name>JMIR Publications</publisher-name><publisher-loc>Toronto, Canada</publisher-loc></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">v5i1e90759</article-id><article-id pub-id-type="doi">10.2196/90759</article-id><article-categories><subj-group subj-group-type="heading"><subject>Letter to the Editor</subject></subj-group></article-categories><title-group><article-title>Toward Retrieval-Grounded Evaluation for Conversational Large Language Model&#x2013;Based Risk Assessment</article-title></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name name-style="western"><surname>Hu</surname><given-names>Yihan</given-names></name><degrees>BSc, MPhil</degrees><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff id="aff1"><institution>MRC Epidemiology Unit, University of Cambridge</institution><addr-line>The Old Schools, Trinity Ln</addr-line><addr-line>Cambridge</addr-line><addr-line>England</addr-line><country>United Kingdom</country></aff><contrib-group><contrib contrib-type="editor"><name name-style="western"><surname>Coristine</surname><given-names>Andrew</given-names></name></contrib></contrib-group><author-notes><corresp>Correspondence to Yihan Hu, BSc, MPhil, MRC Epidemiology Unit, University of Cambridge, The Old Schools, Trinity Ln, Cambridge, England, CB2 1TN, United Kingdom, 44 07526543793; <email>yh623@cam.ac.uk</email></corresp></author-notes><pub-date pub-type="collection"><year>2026</year></pub-date><pub-date pub-type="epub"><day>12</day><month>3</month><year>2026</year></pub-date><volume>5</volume><elocation-id>e90759</elocation-id><history><date date-type="received"><day>03</day><month>01</month><year>2026</year></date><date date-type="rev-recd"><day>07</day><month>01</month><year>2026</year></date><date date-type="accepted"><day>17</day><month>02</month><year>2026</year></date></history><copyright-statement>&#x00A9; Yihan Hu. Originally published in JMIR AI (<ext-link ext-link-type="uri" xlink:href="https://ai.jmir.org">https://ai.jmir.org</ext-link>), 12.3.2026. </copyright-statement><copyright-year>2026</copyright-year><license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR AI, is properly cited. The complete bibliographic information, a link to the original publication on <ext-link ext-link-type="uri" xlink:href="https://www.ai.jmir.org/">https://www.ai.jmir.org/</ext-link>, as well as this copyright and license information must be included.</p></license><self-uri xlink:type="simple" xlink:href="https://ai.jmir.org/2026/1/e90759"/><related-article related-article-type="commentary" ext-link-type="doi" xlink:href="10.2196/91981" xlink:title="Comment in" xlink:type="simple">https://ai.jmir.org/2026/1/e91981</related-article><related-article related-article-type="commentary article" ext-link-type="doi" xlink:href="10.2196/67363" xlink:title="Comment on" xlink:type="simple">https://ai.jmir.org/2025/1/e67363</related-article><kwd-group><kwd>personalized risk assessment</kwd><kwd>large language model</kwd><kwd>artificial intelligence</kwd><kwd>conversational AI</kwd><kwd>COVID-19</kwd></kwd-group></article-meta></front><body><p>We read with great interest the paper by Roshani et al [<xref ref-type="bibr" rid="ref1">1</xref>] on a generative large language model (LLM)&#x2013;powered conversational app for pediatric COVID-19 severity risk assessment. The study makes a timely and valuable contribution by demonstrating how white-box LLMs (LLaMA2 [Meta], T0 [BigScience], and Flan-T5 [Google]) can be fine-tuned in low-data settings and deployed within an end-to-end mobile workflow. We also commend the authors for emphasizing local deployment and interpretability through attention-based feature importance.</p><p>We would like to highlight one methodological consideration that may affect how readers interpret the reported accuracy and the system&#x2019;s readiness for clinical or public-facing use. In the study, model outputs are operationalized as a binary risk score derived from the token probabilities of &#x201C;yes&#x201D; versus &#x201C;no,&#x201D; and performance is primarily summarized using the area under the receiver operating characteristic curve (AUC) [<xref ref-type="bibr" rid="ref1">1</xref>]. While this is an appropriate discriminative metric for outcome prediction within the dataset, the system is presented as a conversational interface intended to support real-time risk assessment. In such settings, a key safety concern extends beyond misclassification to the generation of fluent but unverifiable or incorrect clinical statements. Recent empirical evaluations of retrieval-augmented LLMs in guideline-grounded clinical tasks underscore the importance of grounding and safety-focused assessment in medical settings [<xref ref-type="bibr" rid="ref2">2</xref>].</p><p>Recent advances in retrieval-augmented generation (RAG) challenge the implicit assumption that response fluency and stable token-level scoring sufficiently capture clinical reliability. Under an LLM-only paradigm, outputs may appear consistent and plausible, yet factual correctness is difficult to audit in the absence of explicit source grounding. By contrast, RAG enables models to condition responses on external authoritative resources, such as clinical guidelines or curated medical databases, thereby supporting source-linked verification and more stringent evaluation [<xref ref-type="bibr" rid="ref3">3</xref>]. Importantly, this represents not merely an architectural refinement but a shift in what can be meaningfully evaluated.</p><p>We therefore suggest that future work in this area would benefit from a retrieval-grounded sensitivity analysis alongside conventional AUC reporting. For example, the same conversational pipeline could be evaluated under two conditions: (1) the current LLM-only setting and (2) an evidence-grounded setting in which responses are generated with explicit citations to a prespecified clinical corpus, such as pediatric COVID-19 guidance from the Centers for Disease Control and Prevention or equivalent institutional protocols. Evaluation could then incorporate evidence-grounded correctness, citation faithfulness, and robustness to retrieval constraints. In addition, subgroup-level audits&#x2014;using metrics such as recall or <italic>F</italic><sub>1</sub>-score stratified by demographic or social determinant variables&#x2014;could help identify whether aggregate performance masks safety-relevant disparities across populations.</p><p>We acknowledge that retrieval-grounded approaches introduce practical challenges, including retrieval latency and the need for ongoing corpus curation. Nonetheless, in high-stakes pediatric and public health applications, these trade-offs may be justified by gains in verifiability and trustworthiness. We commend the authors for advancing conversational LLM&#x2013;based risk assessment and hope these considerations help further strengthen evaluation rigor and clinical interpretability in future deployments.</p></body><back><fn-group><fn fn-type="conflict"><p>None declared.</p></fn></fn-group><glossary><title>Abbreviations</title><def-list><def-item><term id="abb1">AUC</term><def><p>area under the receiver operating characteristic curve</p></def></def-item><def-item><term id="abb2">LLM</term><def><p>large language model</p></def></def-item><def-item><term id="abb3">RAG</term><def><p>retrieval-augmented generation</p></def></def-item></def-list></glossary><ref-list><title>References</title><ref id="ref1"><label>1</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Roshani</surname><given-names>MA</given-names> </name><name name-style="western"><surname>Zhou</surname><given-names>X</given-names> </name><name name-style="western"><surname>Qiang</surname><given-names>Y</given-names> </name><etal/></person-group><article-title>Generative large language model-powered conversational AI app for personalized risk assessment: case study in COVID-19</article-title><source>JMIR AI</source><year>2025</year><month>03</month><day>27</day><volume>4</volume><fpage>e67363</fpage><pub-id pub-id-type="doi">10.2196/67363</pub-id><pub-id pub-id-type="medline">40146990</pub-id></nlm-citation></ref><ref id="ref2"><label>2</label><nlm-citation citation-type="journal"><person-group person-group-type="author"><name name-style="western"><surname>Ke</surname><given-names>YH</given-names> </name><name name-style="western"><surname>Jin</surname><given-names>L</given-names> </name><name name-style="western"><surname>Elangovan</surname><given-names>K</given-names> </name><etal/></person-group><article-title>Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness</article-title><source>NPJ Digit Med</source><year>2025</year><month>04</month><day>5</day><volume>8</volume><issue>1</issue><fpage>187</fpage><pub-id pub-id-type="doi">10.1038/s41746-025-01519-z</pub-id><pub-id pub-id-type="medline">40185842</pub-id></nlm-citation></ref><ref id="ref3"><label>3</label><nlm-citation citation-type="other"><person-group person-group-type="author"><name name-style="western"><surname>Lewis</surname><given-names>P</given-names> </name><name name-style="western"><surname>Perez</surname><given-names>E</given-names> </name><name name-style="western"><surname>Piktus</surname><given-names>A</given-names> </name><etal/></person-group><article-title>Retrieval-augmented generation for knowledge-intensive NLP tasks</article-title><source>arXiv</source><comment>Preprint posted online on  May 22, 2020</comment><pub-id pub-id-type="doi">10.48550/arXiv.2005.11401</pub-id></nlm-citation></ref></ref-list></back></article>