AI Chatbot Answers for Drug Dosing Adjustments According to Renal Function in Geriatric Patients Using the New Scoring System (AI Quality Output Score): Cross-Sectional Study

doi:10.2196/87803

¹Department of Clinical Pharmacy, Institute of Pharmacy, Faculty of Medicine, Leipzig University, Brüderstraße 32, Leipzig, Saxony, Germany

²Drug Safety Center, Leipzig University and University of Leipzig Medical Center, Leipzig, Saxony, Germany

³Sana Geriatric Centre Zwenkau, Zwenkau, Saxony, Germany

Corresponding Author:

Thilo Bertsche, Prof Dr

Background: Preventable adverse drug reactions in geriatric patients are caused by overdosing, especially in cases of impaired renal function. Artificial intelligence (AI) chatbots are being discussed as tools to generate drug information, which can adjust drug dosing and prevent subsequent adverse drug reactions based on individualized patient data. However, the question arises as to the extent to which such AI chatbots can withstand scientific evaluation in this task.

Objective: We newly developed and validated the AI quality output score (AQUOS, ranging from 0% to 100%) to assess the quality of AI chatbot answers. We investigated whether AQUOS depends on (1) renal function, (2) medication complexity, (3) prompting language (English and German), and (4) whether the answers are reproducible (assessed at 2 independent times). Additionally, we assessed the potential for harm.

Methods: In a standardized prompt, we asked 4 AI chatbots (ChatGPT, Copilot, Gemini, and Scite) whether the medication of 100 geriatric patients with polymedication at discharge should be adjusted according to their renal function. We prompted drug-related queries in 2 languages and at 2 times to assess AI chatbot answers, and we scored the generated outputs based on AQUOS. Additionally, we assessed possible harm from the AI chatbot answers using the World Health Organization definition “The conceptual framework for the international classification for patient safety.”

Results: We analyzed 1600 AI chatbot answers, with AQUOS values ranging from −19.0% to 95.2%, depending on the chatbot. We found that AQUOS declined with decreasing renal function (ChatGPT: −0.215; P=.03) and increasing medication complexity (Scite: −0.239; P=.02). Possible harm also correlated with more complicated patient statuses (lower kidney function and higher medication complexity) across all chatbots. Overall scores were up to 4.8% higher in English than in German prompting. The AI chatbot answers were highly reproducible.

Conclusions: In renal drug dosing, the quality of AI chatbot answers declined as renal function decreased and medication complexity increased. Even the highest AQUOS achieved is insufficient for deploying AI chatbots in the high-risk health care sector.

JMIR AI 2026;5:e87803

doi:10.2196/87803

Keywords

artificial intelligence; AI; large language models; LLMs; pharmaceutical; score; decision-making

Artificial intelligence (AI), and especially AI chatbots, has become important in medical practice, supporting clinical decision-making [1,2]. AI chatbots, AI-based search engines, and other large language models (LLMs) are increasingly studied for their potential to make therapy processes—and, in particular, the gathering of drug information—faster and more efficient. In contrast to the high number of publications dealing with AI in health care, only a few involve real patient data [3]. Earlier studies often focus on limited aspects and lack practical relevance to specific clinical needs, such as renal dosage adjustment [4-6].

Adjusting medication based on renal function is challenging, especially for geriatric patients. Adverse drug reactions often result from overdosing, frequently caused by impaired renal function [7]. Age-related changes in pharmacokinetic parameters, comorbidities, and polypharmacy require careful, individualized drug therapy, especially in terms of patient safety and the safety of drug therapy [8,9]. However, tailored medication dosages adjusted to renal function should also be readily available under routine conditions to prevent avoidable adverse drug reactions. AI chatbots could have great potential in this context, although it should be noted from a legal perspective that they are not classified as medical devices.

Until now, no standardized score has existed for a quantitative quality assessment of AI chatbot answers in drug-related queries. To close this gap, we developed a new score to evaluate the quality (AI quality output score [AQUOS]) of AI chatbot answers based on the literature, adapted for drug-related queries [10,11]. Rather than relying on casual, everyday phrasing, prompts were structured using prompt engineering to provide consistent input and optimize AI potential [12,13]. Because medication queries are not only applicable to geriatric patients and renal function, AQUOS applies to a broader range of drug-related clinical scenarios.

Our study aimed to evaluate the quality and potential harm of drug-related queries addressing tailored renal dosing. We developed a score assessing quality (AQUOS) to see how it varies with renal function and medication complexity (ie, the number of drugs prescribed). Furthermore, we evaluated 2 prompting languages, as in routine practice, many questions are asked in the native language, and no previous studies have compared multilingual performance in this context. We also tested whether the AI chatbot outputs are reproducible over time. In addition to the score, we assessed potential harm according to the World Health Organization (WHO) definition.

The cross-sectional, observational study is based on the CHART (Chatbot Assessment Reporting Tool) checklist (Checklist 1) [14].

Ethical Considerations

The ethics committee of the Medical Faculty of Leipzig University (231/24-ek) approved the procedure on July 29, 2024. Due to the retrospective collection of patient data, no informed consent for publication was obtained; therefore, written informed consent was not required. Anonymized prompts were used to ensure patient privacy.

Setting

In 2024, we used GeriDoc from the Geriatrics in Bavaria database to collect retrospective patient data from a geriatric hospital.

Patient Data

We included data from geriatric patients in the rehabilitation ward who were hospitalized in 2023.

One hundred patients who meet the inclusion criteria must be included. An additional 10 patients were used for the pretesting of the AQUOS, and these patients were excluded from the main analysis.

Inclusion Criteria

Eleven randomly selected patients per month from January 2023 to October 2023 were chosen for this study. To be included, renal function (glomerular filtration rate [GFR]) and polymedication (at least 5 drugs) at discharge had to be documented. In this case, polymedication means the patient must take at least 5 drugs; for example, a combined preparation of 2 drugs counts as 2, not 1.

Study Design

In this study, we compared 4 different AI chatbots and designed structured inputs or prompts. After receiving ethical approval, we generated the AI outputs from October 18 to October 30, 2024, in Leipzig, Germany. The GFR was categorized into the typical 5 stages of renal disease: category 1 as normal renal function, category 2 as slightly reduced, category 3 as moderately reduced, category 4 as severely reduced, and category 5 as renal failure. The inputs were designed in both German and English, and outputs were evaluated at 2 time points: t0 and t1 (8 d later) to test reproducibility (not the learning effect after updates). Thus, t1 serves as the control for t0. We used an AQUOS (Multimedia Appendix 1) for evaluation. Additionally, discharge medications were categorized by complexity: low (5‐9 drugs), medium (10-14 drugs), and high (≥15 drugs). In this study, medication complexity is defined exclusively as complexity based on the number of medications and is divided into 3 categories (low, medium, and high). This measure reflects structural complexity based on the number of prescribed drugs and does not account for pharmacological risk, therapeutic drug classification, or potential drug-drug interactions.

Prompting

We created a prompt based on prompt engineering, assigning roles to both the AI chatbot and the requester. We specified the renal function as GFR and listed the patient’s discharge medications and dosages. Following this, we instructed the chatbot to provide a precise answer if a dosage adjustment was necessary and to cite sources as a concrete task using both German and English prompts. The following prompting structure was developed based on literature [12,13] by the 2 authors, CB and TB, who are both pharmacists. We adjusted the prompt based on the quality of the AI chatbot’s responses. The aim was to frame the query as a standardized zero-shot prompt without any follow-up prompts. Neither patients nor members of the public were involved in the development process.

An example prompt in English:

I am a physician in a hospital. Give your answers from a pharmacist’s point of view. It’s about a geriatric patient with a GFR of 62 ml/min who is taking the following medication: Acetylsalicylic acid 100 mg once a day, Ramipril 10 mg once a day, Atorvastatin 10 mg once a day, Pantoprazole 20 mg once a day, Metamizole 500 mg if required up to four times a day. Give a precise, short answer whether and how the dosage should be adjusted for the current GFR. Give reliable sources, including links, to your answer.

The prompt was deliberately restricted to GFR, drug name, dose, and frequency. This standardized, minimal input format was chosen to ensure comparability across all 100 patient cases and all 4 AI chatbots and to reflect a realistic scenario of brief, point-of-care queries as they might occur in routine clinical practice. We acknowledge that clinically valid renal dosage adjustment may additionally depend on variables such as the indication for each drug, route of administration, treatment duration, dialysis status, body weight, or the differentiation of acute and chronic renal impairment. While more complex prompt formats incorporating additional clinical variables could yield more nuanced AI outputs, such designs would compromise standardization and cross-case comparability, which was an essential aspect of this study. The omission of these variables limits the clinical interpretability of the findings and should be considered when applying results to real-world settings.

If multiple GFR values were included in the discharge letter, the median GFR was used. The drug dosing was provided as in the example, along with the drug name and dosage frequency. We extracted chatbot outputs from the platform and saved them locally, starting a new conversation for each prompt while ensuring previous conversations were cleared. We configured the software to prevent it from “remembering” previous conversations. If a network error occurred, the output was regenerated. After generating the outputs, we verified that the companies had made no relevant updates to the AI chatbots.

Considering that the prompts were entered in German and English at 2 different times (t0 and t1) in 4 different AI chatbots each, 16 outputs were generated per patient.

AI Chatbots

When collecting the AI chatbots, we focused on large, well-known AI chatbots. Moreover, appropriate settings had to be available to prevent entered data from being used for future AI model updates.

In terms of terminology, “AI chatbot” refers to the conversational interface accessed by users, “LLM” refers to the underlying language model architecture, and “model” is used as a general term for naming the different versions.

We used OpenAI GPT-4 (gpt-4o-2024-11-20, knowledge cutoff at October 01, 2023), Microsoft Copilot Business, Google Gemini 1.5 Flash (gemini-1.5-flash-002), and Research Solutions Scite, all of which are closed sources [15,16]. The exact models of the AI chatbots were not consistently accessible at the time of data collection, but we verified post hoc that no relevant model updates were released by the respective providers during the data collection window. We have accessed the AI chatbots using the web interface of each provider. Furthermore, the standard settings (eg, temperature) were applied to all AI chatbots, which were used as base models as provided by their respective companies. For each patient, a new conversation was initiated by entering the prompt and then saving the AI chatbot’s response locally, and the conversation was subsequently deleted before the next case was processed. With this procedure, no carry-over effects between patient cases could occur.

Quality Score (AQUOS)

Based on previous research [10,17,18], we implemented an AQUOS (Multimedia Appendix 1) to evaluate the outputs of the AI chatbots. The score consists of 9 items. The first 5 items, rated from 0 to 4, focus on completeness, referencing, reference suitability, correctness of dosage recommendations, and dosing accuracy. The sixth item, rated 0 or 1 point, checks for disclaimers and references to health care professionals, patient monitoring, and the individual patient case. The last 3 items allow point deductions for unnecessary additional information, incorrect use of medical terms, and inappropriate language or phrasing, with greater issues leading to greater point reductions. Thus, the highest achievable score was 21, which equals 100%.

As a reference standard for evaluating the correctness of the AI chatbot outputs, the German database “Dosing” [19], combined with the corresponding summary of product characteristics, was used. In cases of discrepancies, the more recent source was given preference. If discrepancies persisted, clinical guidelines, original publications, or information from the European Medicines Agency were consulted. In Multimedia Appendix 2, some higher quality and poorer examples of AI chatbot outputs are provided.

Assessment of Harm

In addition to AQUOS, the possible harm the chatbot could have caused with its response was assessed. The outputs were evaluated to determine whether the specific output could cause harm if provided to a physician as pharmaceutical advice. The possible harm was ranked using the Conceptual Framework for the International Classification for Patient Safety by the WHO Patient Safety [20]. The ranking of the harm ranges from “none” to “death,” according to the WHO categories: none, mild, moderate, severe, and death (0-5). Since the scaling differs from AQUOS, possible harm was considered separately.

Validation Procedure of AQUOS

The validation of the AQUOS scoring system was conducted in 2 sequential phases, following the principles of internal and external validation as commonly applied in laboratory and clinical assay development.

This validation design used in this study broadly follows the principles used in the development of clinical scoring systems, where initial reliability testing is followed by a comparison with expert consensus, including an interrater reliability and intraclass correlation coefficient [21-25]. The detailed validation procedure is provided in Multimedia Appendix 3.

Outcomes

We analyzed the points achieved in AQUOS to evaluate the quality of the AI chatbot outputs. Medication count–based complexity (categorized into 3 groups depending on the number of drugs, not the type of drugs or, eg, possible drug-drug interactions) and GFR (categorized according to the typical 5 stages, ranging from normal renal function to renal failure [26]) were analyzed in correlation with the score. Additionally, we investigated whether the AQUOS differs between English and German and whether the AI chatbot answers are reproducible over time (t0 and t1).

Statistical Methods

To analyze the data, we conducted descriptive analyses (mean, median, relative difference, and correlations) and a paired, 2-tailed t test to assess statistical significance, with P<.05 considered statistically significant. To evaluate the first validation phase, we performed Cohen κ to examine whether the score was objective and to continue with just 1 rater in the study. For the second validation phase of AQUOS, we conducted an intraclass correlation coefficient analysis among the expert opinions, and AQUOS was then correlated with the median of the expert ratings using the Spearman’s correlation.

The statistical analysis was performed using IBM SPSS Statistics version 29 and Excel version 2408 from Microsoft 365.

Overview

The main analysis included 100 geriatric patients, while the pilot study involved 10 geriatric patients. The mean (SD) number of discharge medications was 11.4 (4.2) for the primary group and 11.1 (2.8) for the pilot group. Patients’ characteristics and the medication count from the main analysis are summarized in Table 1.

In total, we generated 1600 outputs using the AI chatbots. More precisely, this means that 16 outputs were generated per patient using 4 chatbots (ChatGPT [OpenAI], Copilot [Microsoft], Gemini [Google], and Scite [Research Solutions]), each at 2 different times for the investigation of reproducibility (t0 and t1) and in 2 different languages (German and English). Multimedia Appendix 2 provides examples of how the AI chatbots’ outputs were evaluated. A study flow diagram illustrating patient selection, the pilot sample, the derivation of the final 100 included patient cases, and the generation of 1600 AI chatbot outputs is provided in Figure 1.

Table 1. Patient characteristics (N=100; all geriatric patients) and medication count from the main analysis^a.

Characteristics	Low complexity	Medium complexity	High complexity
Number of patients	33	46	21
GFR^b, mean (SD; min-max)	67.9 (18.3; 28.0-109.0)	64.6 (18.1; 19.5-100.5)	61.2 (25.1; 15.0-97.0)

^aNumber of drugs taken in different complexity intervals: low complexity=5‐9 drugs, medium complexity=10‐14 drugs, and high complexity=at least 15 drugs. The renal function was measured as GFR.

^bGFR: glomerular filtration rate.

**Figure 1.** Study flow diagram. From 110 randomly selected geriatric patients (11 patients/mo, January 2023-October 2023), 10 were allocated to the pilot sample for artificial intelligence quality output score (AQUOS) validation, and 100 to the main analysis. The 100 patients included in the main analysis were each prompted across 4 AI chatbots, 2 languages, and 2 time points, resulting in 1600 outputs evaluated using AQUOS and World Health Organization (WHO) harm classification. GFR: glomerular filtration rate.

Validation of AQUOS

As a result of the first validation phase, the score was considered objective with a Cohen κ of 0.971, so there was 1 rater, a pharmacist, responsible for scoring. No patients or members of the public were included in the scoring process. For the second score validation phase, the intraclass correlation coefficient of 0.906 (95% CI 0.795‐0.974; P<.001) shows excellent agreement between the raters of the expert panel. In addition, the Spearman correlation, with a Spearman ρ of 0.650 (95% CI 0.012‐0.912; P=.04) between AQUOS and the median of the expert panel, validates AQUOS in an external validation conducted by experts.

Renal Function

There were 5 patients in the normal renal function category, 62 in category 2 (slightly reduced), 25 in category 3 (moderately reduced), 8 in category 4 (severely reduced), and no patients in category 5 (renal failure) based on GFR classification.

The trends of overall AQUOS in German and English, as shown in Figure 2 and Multimedia Appendix 4, decline with worsening renal function (ChatGPT: –0.215, P=.03; Copilot: –0.258, P=.01; Scite: –0.357, P<.01; Multimedia Appendix 5). Regarding the English outputs, Copilot and Gemini had higher overall mean (SD) scores from category 3 (69.3%, 16.0% and 41.9%, 60.0%) to category 4 (71.4%, 9.4% and 45.8%, 31.4%). The maximum overall mean AQUOS was reached by ChatGPT (81.0%, SD 18.6%). Gemini had the lowest overall mean AQUOS of 41.9% (SD 60.0%). ChatGPT reached the highest single overall AQUOS with 95.2%, while Gemini achieved the lowest overall score in English (−19.0%).

Gemini caused possible mild harm in category 2 (mean 0.6, SD 0.9). In category 3, mild harm was noted in Copilot (mean 0.5, SD 0.7), Gemini (mean 0.7, SD 0.7), and Scite (mean 0.5, SD 0.7). Category 4 indicated overall possible mild harm across all chatbots. Significant correlations between possible harm and GFR categories were found: ChatGPT (0.396; P<.001), Copilot (0.476; P<.001), and Scite (0.443; P<.001).

**Figure 2.** Mean overall scores (artificial intelligence quality output score [AQUOS]) of 4 AI chatbots based on glomerular filtration rate (GFR) categories (A, B) and the complexity of medication (C, D) in both German and English. AI: artificial intelligence.

Medication Count–Based Complexity

In the low-complexity category, 33 patients were included; in the medium-complexity category, 46 patients were included; and in the high-complexity category, 21 patients were included.

The trends of overall AQUOS in German and English are shown in Figure 2, with specific values provided in Multimedia Appendix 6. The overall AQUOS declines as the complexity increases (Scite: −0.239, P=.02; Multimedia Appendix 5). Regarding the English outputs, the only exception is Copilot, where AQUOS increases from medium to high complexity, but not more than the score in low complexity. The maximum overall mean AQUOS was achieved by ChatGPT (79.1%, SD 9.0%). The lowest overall mean AQUOS was observed Gemini (43.3%, SD 25.5%). In low complexity, ChatGPT reached the highest single overall AQUOS (95.2%), while Gemini reached the lowest single overall score (−19.0%).

Gemini caused possible harm in medium complexity, with a mean of 0.9 (SD 1.0), corresponding to mild harm. In high complexity, possible harm was mild and was caused by Gemini (mean 0.6, SD 1.0). There was a significant correlation coefficient between the possible harm and the complexity only in German: ChatGPT (0.219; P=.03), Copilot (0.226; P=.02), and Scite (0.237; P=.02).

Prompting Language

For every chatbot, the overall AQUOS were higher when the prompt was in English than in German (Figure 3). The only exception was Copilot, whose overall AQUOS scores were the same (median 71.4% [IQR 13.3%] in German and 71.4% [IQR 6.7%] in English) in both languages. ChatGPT achieved the highest scores in English (median 76.2% [IQR 12.5%]), while Gemini reached the lowest scores in German (median 42.9% [IQR 33.3%]). There was no possible harm in the median; however, regarding the IQR of Gemini in both languages, there was mild harm.

Reproducibility

There was no significant difference between t0 and t1 regarding the overall AQUOS (Table 2). However, some single-score criteria showed significant differences.

In low complexity, there was a significant difference at Gemini in English regarding AQUOS criterion 6 (referral to medical or pharmaceutical professionals or follow-up checks), with a relative difference of 15.2% (P=.02). In medium complexity, there was a significant difference at ChatGPT in German for criterion 3 (suitability of the references given), with a relative difference of 5.4% (P=.01). Other significant differences were observed at Gemini, all in English, for criterion 4 (correct dose recommendation) and possible harm, with corresponding relative differences of 16.9% (P=.004) and 12.5% (P=.001).

Table 2. Reproducibility for all overall artificial intelligence quality output scores (AQUOS; t0 vs t1) and relative differences (P value)^a.

Complexity	ChatGPT		Copilot		Gemini		Scite
Complexity	German	English	German	English	German	English	German	English
Low complexity (%)	2.50 (P=.94)	1.89 (P=.17)	2.95 (P=.08)	2.42 (P=.51)	2.58 (P=.12)	5.33 (P=.72)	2.42 (P=.16)	1.52 (P=.34)
Medium complexity (%)	2.66 (P=.08)	2.23 (P=.66)	1.47 (P=.41)	2.07 (P=.94)	1.20 (P=.52)	6.03 (P=.42)	1.74 (P=.25)	0.82 (P=.66)
High complexity (%)	4.40 (P=.41)	2.98 (P=.23)	4.29 (P=.86)	1.43 (P=.43)	2.74 (P=.69)	4.29 (P=.68)	2.62 (P=.75)	2.14 (P=.37)

^aNumber of drugs taken in different complexity intervals: low complexity=5‐9 drugs; medium complexity=10‐14 drugs; high complexity=at least 15 drugs, and the 4 different chatbots (ChatGPT, Copilot, Gemini, and Scite).

Key Findings