Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation

doi:10.2196/83640

Published on 20.Feb.2026 in Vol 5 (2026)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/83640, first published 05.Sep.2025.

Tablet displaying medical icons: first aid kit, pills, and a red dot.

Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation

Saubhagya Joshi¹

; Monjil Mehta²

; Sarjak Maniar²

; Mengqian Wang³

; Vivek Kumar Singh¹

Article Authors Cited by (1) Tweetations Metrics

Journals

Harada Y. Safety Audit of a Large Language Model for Lay Self-Triage Using Japanese Symptom Vignettes: Persistent Red-Flag Under-Triage Despite Improved Reproducibility Under Near-Deterministic Decoding. Cureus 2026 View

Citation

Please cite as:

Joshi S, Mehta M, Maniar S, Wang M, Singh VK
Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation
JMIR AI 2026;5:e83640
doi: 10.2196/83640 PMID: 41719488 PMCID: 12923095

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Responsible Health AI (182) Clinical Informatics (2098) mHealth in a Clinical Setting (1079) Decision Support for Health Professionals (2072) Generative Language Models Including ChatGPT (1370)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn