Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation

doi:10.2196/72153

Published on 04.Dec.2025 in Vol 4 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/72153, first published 04.Feb.2025.

CLEVER: New evaluation methodology for Large Language Models in Healthcare

Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation

Veysel Kocaman¹

; Mustafa Aytuğ Kaya²

; Andrei Marian Feier¹

; David Talby¹

Article Authors Cited by (8) Tweetations Metrics

Journals

Tan C, Gunasekeran D, Low C, Sim G, Foo D, Morris R, Wong T. Regulation of clinical Artificial Intelligence (AI) in the Age of Agents: Unconfined Non-Deterministic Clinical Software (UNDCS) systems for healthcare. npj Digital Medicine 2026;9(1) View
Yeh Y, Shih M, De Backer D, Celi L, See K, Fujii T, Ling L, Mongkolpun W, Hu H, Chen H, Chen W, Cholley B, Fong K, Ryu H, Na S, Egi M, Chan W, Chen K, Kamaleswaran R, Chuang Y, Yang C, Hsiao W, Lai S, Ku D, Jahan A, Martin G. The IMPACT framework for evaluating generative AI in critical care: development and multinational consensus validation. Annals of Intensive Care 2026;16:100078 View
Cheng A, Elkhadrawy A, Setzen S, Li A, Biskaduros A, Kostas J, Rameau A. Evaluating Injection Laryngoplasty Skills Using a Foundation Model: A Feasibility Study. The Laryngoscope 2026 View
Sulaimanov U, Sanlier N, Moniri A, Demir B, Serikkanov Y, Bayramoglu A, Al-Jebur M, Sanlier M, Erginoglu U, Otles E, Ammanuel S, Keles A, Erginoglu U, Baskaya M. Evaluating Large Language Models for Automated Evidence Synthesis in Neuroimaging AI: A Multi-Model Benchmark. Journal of Clinical Medicine 2026;15(11):4230 View
Sha Y, Yu L, Lin Z, Kaur A, Lou Y, Gornale S, Zhang T, Wong L, Wang Z, Yan Y, Zhang X, Hong R, Li K, Im S, de Carvalho P, Tan T, Li K. A roadmap for medical large language models: a review of foundations, applications, and challenges. Military Medical Research 2026;13(1):100050 View
Eryılmaz B, Arzideh K, Bahn M, Damm H, Warmer S, Schäfer H, Idrissi-Yaghir A, Pakull T, Albrecht L, Kleesiek J, Lodde G, Friedrich C, Livingstone E, Schadendorf D, Borys K, Nensa F, Hosch R. Extracting Medical Information From Unstructured Clinical Text Using Large Language Models to Enhance Health Care Interoperability: Proof-of-Concept Study. Journal of Medical Internet Research 2026;28:e92413 View
Çögenli M. Visible AI assistant outputs in psychosocial risk management: an OHP/OHS-grounded benchmark for governance and responsible use. Frontiers in Public Health 2026;14 View

Conference Proceedings

Deva R, Thapa A, Mehta Z, Kapile S, Jalota S, Karusala N, Ismail A. Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency. LLM Evaluation as Sociotechnical Practice: Experiences from a Research-Practice Partnership in Community Health View
Ramadan U, Beladinna Arifa A, Adhitama R. 2026 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT). Impact of Knowledge Source Type on RAG-Based LLMs in Specialized Medical Domains: A Case Study on Systemic Lupus Erythematosus View

This paper is in the following e-collection/theme issue:

Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation

Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation

Journals

Conference Proceedings