Published on in Vol 5 (2026)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/77149, first published .
Leveraging Large Language Models to Improve the Readability of German Online Medical Texts: Evaluation Study

Leveraging Large Language Models to Improve the Readability of German Online Medical Texts: Evaluation Study

Leveraging Large Language Models to Improve the Readability of German Online Medical Texts: Evaluation Study

1Faculty of Informatics, Heilbronn University, Max-Planck-Str. 39, Heilbronn, Germany

2Research and Innovation Center for Cognitive Service Systems (KODIS), Fraunhofer Institute for Industrial Engineering, Stuttgart, Germany

3Consumer Health Informatics Special Interest Group, German Association for Medical Informatics, Biometry and Epidemiology (GMDS) e.V., Cologne, Germany

Corresponding Author:

Monika Pobiruchin, Dr sc hum


Background: Patient education materials (PEMs) found online are often written at a complexity level too high for the average reader, which can hinder understanding and informed decision-making. Large language models (LLMs) may offer a solution by simplifying complex medical texts. To date, little is known about how well LLMs can handle simplification tasks for German-language PEMs.

Objective: The study aims to investigate whether LLMs can increase the readability of German online medical texts to a recommended level.

Methods: A sample of 60 German texts originating from online medical resources was compiled. To improve the readability of these texts, four LLMs were selected and used for text simplification: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, and Le Chat. Next, readability scores (Flesch reading ease [FRE] and Wiener Sachtextformel [4th Vienna Formula; WSTF]) of the original texts were computed and compared to the rephrased LLM versions. A Student t test for paired samples was used to test the reduction of readability scores, ideally to or lower than the eighth grade level.

Results: Most of the original texts were rated as difficult to quite difficult (average WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64). On average, the LLMs achieved the following average scores: ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), ChatGPT-4o (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51), and Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58). ChatGPT-3.5, ChatGPT-40, and Microsoft Copilot showed a statistically significant improvement in readability. However, the t tests yielded no statistically significant results for the reduction of scores lower than the eighth grade level.

Conclusions: LLMs can improve the readability of PEMs in German. This moderate improvement can support patients reading PEMs online. LLMs demonstrated their potential to make complex online medical text more accessible to a broader audience by increasing readability. This is the first study to evaluate this for German online medical texts.

JMIR AI 2026;5:e77149

doi:10.2196/77149

Keywords



Overview

In the digital era, health information is widely available [1] and exists in many different forms, for example, Wikipedia articles, health-related websites, leaflets, and brochures [2], categorized as patient education materials (PEM). Such materials support medical laypersons in learning about diseases, diagnoses, therapies, etc [3]. Health information should be easy to understand for the general population and to promote health literacy [4]. In this context, the COVID-19 pandemic confirmed the need to improve the general scientific and health literacy [5-7].

However, Zowalla et al [3,8], Rooney et al [9], Yeung et al [5], Gordejeva et al [10], and others have shown that the readability of health information is often unsatisfactory regardless of its source (online material, booklets), authors (official bodies and institutions, individuals), or language. Many of these resources pose a challenge due to their sentence complexity and use of expert language, making it difficult for laypersons to effectively read and understand such material.

Artificial intelligence (AI) offers potential for substantial improvements in health care applications and is becoming omnipresent in recent years [11]. In particular, large language models (LLMs) represent promising opportunities [12,13]. In this context, LLMs can be leveraged to improve the readability of existing PEMs intended for citizens.

Being easily accessible for everyone [14], citizens and patients alike will use them to seek health information online, answers to specific questions, or even therapy advice similar to Internet search engines [15].

For these reasons, citizens will use LLMs to translate complex medical terminology and to simplify text material, aiming for an improved comprehensibility [16,17]. Moreover, an increased integration of AI in the curation of health information offers several benefits [18] for institutions, primarily time and cost savings.

Prior Work

There is a decade-long research tradition about how to use and implement decision support systems, machine learning, and AI solutions in health care. Since 2023, with the widespread availability of LLMs [19], those technologies have been explored for beneficial health care use cases [15] in several medical domains [20-28].

Researchers investigated how publicly available LLMs interfere with patients’ information seeking behavior. Eng et al [29] entered questions about rotator cuff repair surgery in ChatGPT-3.5 and let 2 orthopedic surgeons analyze the answers. The questions were derived from frequently asked questions (FAQs) sites from various patient information websites. The answers by the LLM were evaluated in terms of readability (Flesch-Kincaid grade level); the Journal of the American Medical Association Benchmark criteria and the DISCERN score [30] were also used to evaluate reliability and the quality of health information on the internet. The average readability level of the generated answers was equal to a college freshman (Flesch-Kincaid grade of 13.4).

Similar work was conducted by Mika et al [31] who fed ChatGPT with “ten frequently asked questions regarding total hip arthroplasty.” They found that the generated answers were fairly accurate and would be easily understood by patients. Another commonly used readability metric is the Flesch reading ease (FRE) score, which ranges from 0 to 100; lower values indicate a low level of readability, that is, it is difficult to read the text.

Li et al [32] let ChatGPT process 400 English radiology reports (the mean length of words was 164, SD 117). The FRE improved from “38.0±11.8” to “83.5±5.6”.

Similar results were reported by Moons et al [33] who used ChatGPT and Google Bard to reformulate 3 “selected patient information sections from scientific journals.” Google Bard was able to reduce the reading level of the texts to that of sixth graders. However, this was achieved by omission of “significant information” [33].

In an analysis of PEM for bariatric surgery, ChatGPT (version 3.5 and 4) and Google Bard were able to improve the readability levels of 66 frequently asked questions pages of US-based hospitals. The mean FRE scores of the original texts were “48.1 (SD 19.0), which corresponded to ‘difficult to read,’ while initial responses from GPT-3.5, GPT-4.0 and Bard achieved mean scores of 31.4 (SD 11.4), 42.7 (SD 9.7), and 56.3 (SD 11.6), which corresponded to ‘difficult to read,’ ‘difficult to read,’ and ‘fairly difficult to read,’ respectively” [34]. The authors also evaluated the accuracy of the simplified texts. The majority of the LLM responses were equal in accuracy to the original texts, but quality varies among LLMs. Srinivasan et al [34] stress the “importance of evaluating both the readability and quality” of rephrased texts for patient information.

This is also in line with the conclusion by Pal et al [35], who recommend training more specialized LLMs for tasks in the medical domain. They propose that this will add credit and reliability to the answers produced by LLMs in the clinical setting.

Focusing on non-English evaluations, some research was published for expert-centric scenarios: a multilingual benchmark set for answering medical exam questions was developed by the “MedExpQA” study [36]. This contains standardized answers from health experts. To assess the accuracy of medical questions, the study analyzed LLMs with and without retrieval-augmented generation methods. It was found that the models in French, Italian, and Spanish were inferior to those in English. In addition, difficulties such as the tendency to generate incorrect answers and the use of outdated information were identified.

Heilmeyer et al [18] focused on German medical text: they “assessed the feasibility of using nonproprietary LLMs of the GPT variety as writing assistance for medical professionals.” Pretrained LLMs were trained on electronic health records of more than 82,000 patient encounters in a German outpatient clinic. AI tools proved to be “helpful writing assistance” for medical experts and patient reports.

As of today, no readability evaluation has been conducted for LLM-rephrased German health texts from the citizens’ perspective. By citizens’ perspective, this study refers to evaluating LLM-rephrased health texts as they would have been experienced by an average layperson without specialized knowledge or expertise in prompt engineering. This approach reflected the realistic scenario of laypersons seeking health information online, using freely accessible tools without systematically optimizing prompts or using application programming interfaces (APIs) to tune LLM model parameters.

Aims of the Study

The aim of the study is to investigate, from a layperson’s perspective, whether LLMs can simplify and increase the readability of German online medical texts to a recommended level of readability, that is, the eighth grade [37,38].

In this context, 3 specific aims were defined as:

  1. Rephrase German medical texts with LLMs,
  2. Compute their readability, and
  3. Evaluate if the AI-rephrased texts showed a higher level of readability.

Medical Text Corpus

Previous research and a prior sample size calculation (see Statistical Analysis) indicated that the desired reduction in Wiener Sachtextformel (4th Vienna Formula) (WSTF) score to meet the recommended grade level could be shown with a rather small sample (n<10). Therefore, a medical text corpus was compiled to represent high prevalence diseases, for example, cancer and diabetes mellitus, and major public health topics, for example, vaccinations, as well as national bodies and institutions such as the national health portals of Germany and Austria or popular online health websites.

For a full list of included content providers and websites, see Multimedia Appendix 1.

First, all 60 websites were visited with a Chrome web browser. Next, the corresponding texts were manually extracted without any HTML-related markup. The 60 plain text of the resulting corpus files were then used for further processing.

Readability Analysis

The term readability “refers to the properties of written text […] it reflects the (1) complexity of a text’s structure, (2) sentence structure, and (3) chosen vocabulary” [10]. For the German language, 2 well-known readability metrics are established: (1) an adapted version of the (original English) FRE [39] for the German language by Amstad [40] and (2) the WSTF by Bamberger and Vanecek [41]. Both metrics use text parameters like average sentence length and average number of syllables per word.

The FRE score ranges from 0 to 100; lower values indicate a low level of readability, that is, it is difficult to read the text. The WSTF metric ranges from 4 to 15, which roughly corresponds to school grades. For instance, if a text scores a value of 10, at least 10 years in school are necessary for readers to understand this text.

For the readability computations of all texts and to address research aim (2), the analysis framework by Wiesner et al [42] was used. The analysis was conducted on a Windows 10 64-bit computer with Java Runtime Environment (version 18; Oracle Corporation).

Selection of Large Language Models

A scoping review of well-known code platforms such as GitHub [43] or Hugging Face [44] was conducted to search for available LLMs. In addition, online literature databases such as the Association of Computing Machinery Digital Library and Institute of Electrical and Electronics Engineers Explore were searched to scan publications that already used LLMs (see Table 1).

Table 1. Overview of various large language models available as of April 2024.
NameYearDomainDeveloperAvailabilityOpen sourceFree to useLanguage
ChatGPT 3.5 [45]2022GeneralOpenAIWebNoYesENa
GPT4 [45]2023GeneralOpenAIWebNoNoEN
Google Gemini [46]2023GeneralGoogleWebNoYesEN
BERT [47]2018GeneralGoogleLocalYesYesEN
Llama 2 [48]2023GeneralMetaLocalYesYesEN
Claude 2 [49]2023GeneralAnthropicWeb and LocalNoYesEN
T5 [50]2019GeneralGoogleLocalYesYesEN
BLOOM [51]2022GeneralBig ScienceLocalYesYesEN
Microsoft Copilot [52]2021GeneralMicrosoftWebNoYesEN
Falcon LLM [53]2023GeneralTechnology Innovation InstituteLocalYesYesEN
Le Chat [54]2024GeneralMistral AIWebNoYesEN
Phönix [55]2023GeneralMatthias UhligLocalYesYesGERb
LeoLM 7b/13b [56]2023GeneralLAION and HessianAIWeb and LocalYesYesGER
MedAlpaca [57]2023MedicalTianyu Han et alLocalYesYesEN
BioMedLM [58]2024BiomedicalStanford CRFMLocalYesYesEN

aEN: English.

bGER: German.

Some important aspects and criteria influenced the final selection: The language of the LLM—preferably a German-trained model should be used—as well as the specific field or domain of the LLM (general or medical domain).

Some LLMs can only be executed locally, while some can be used via a web front end. The latter would be preferable because in our use case, LLMs should be used by laypeople, who do not have the hardware capabilities at their homes nor the technical knowledge to host and operate LLMs. Preferably, the use of the LLM should be free of charge and open to use, that is, not in a beta test phase or only available for persons with a technical background (ie, programming skills).

Of 15 candidate LLMs, only 3 met the previously outlined criteria and were selected: (1) ChatGPT 3.5, (2) Microsoft Copilot, and (3) Le Chat. In May 2024 (after the LLM scoping review phase), OpenAI presented and launched their new release: GPT-4o. The authors decided to include this model as well. LeoLM (13b) was initially considered but later excluded due to frequent inaccuracies, very short or context-inadequate outputs, occasional language mismatches (answer in English instead of German), and overall unreliability in handling the text material.

Accuracy of Rephrased Health Information Texts

The answers generated by each LLM were independently assessed by 3 reviewers (AM, RZ, MP) with a background in medical informatics. Assessments focused on the medical accuracy, clarity, and plausibility of the information provided, ensuring that each response was evaluated not only for linguistic quality but also for its alignment with established medical knowledge and standards. Interrater agreement was measured by calculating Fleiss κ and percent agreement.

Prompt Engineering

Prompt engineering refers to the process of designing and optimizing the input prompts for any LLM. The structure and content of a prompt can greatly influence the quality of the output generated by the LLM. Today, some techniques have evolved to obtain better results by LLMs:

  • Few-shot prompting provides examples within a prompt to clarify instructions [59]. This approach improves the model’s ability to interpret and respond accurately to the task, as the examples provided serve to establish patterns and context. The term ‘few’ indicates that a limited number of examples are provided as opposed to zero-shot prompting, where no examples are given.
  • Chain of thought prompting breaks down complex tasks into steps within a prompt [60]. This approach mimics human problem solving, guiding the LLM through structured reasoning to generate more accurate responses, especially for tasks that require multiple levels of reasoning.
  • Clue and reasoning prompting introduces a structured reasoning approach [61]. Unlike the other methods, clue and reasoning prompting targets complex linguistic features (eg, irony, contrast, and intensification) and involves 2 stages: the LLM (1) identifies clues within the input (eg, keywords, tone, and references) and (2) uses these clues to perform a reasoning process. This step-by-step approach aims to fill gaps in the LLM’s reasoning capabilities, enabling it to deal more effectively with complex linguistic variations.

For the average person seeking health information online, advanced prompting techniques may be too complex to apply. These techniques require understanding how to structure input for LLMs. Few-shot prompting, for instance, involves providing examples within a prompt, requiring users to explain their needs clearly for effective interpretation. Most people would find designing such prompts difficult and time-consuming, especially when simply needing help understanding the provided health information.

For this reason, the authors decided to use a zero-shot prompting approach. Therefore, an extensive role-prompt approach was evaluated with a subset of the medical text corpus (6‐12 texts) and the 3 web-based LLMs. This prompt contained context information and provided a detailed task description for the LLM. However, using this prompt, the results’ readability decreased.

I, a person with no specialist medical knowledge, would like to understand as simply as possible how a stroke is treated by medical staff. I have read a text from gesund.bund.de, which I did not understand because of the medical terminology. Your task as AI ChatBot is to rewrite the following text so that I can understand it completely at the end. Here is the text: {TEXT}

Iteratively, other approaches were tested, eg, English prompt versus German prompt, or prompts with specific instructions to fine-tune a given readability score. Finally, a reduced German role prompt yielded the best results:

A person with no medical knowledge wants to understand a text. You, as a large language model, must simplify the following text for this person using simple language without reducing the content. Here is the text: {TEXT}

Every text and every LLM was input with this prompt, combined with the actual medical text. Due to the limit of 4000 characters for Microsoft Copilot, the texts were split, and several requests were made. Eventually, a total of 240 LLM conversations were conducted between May and July 2024.

Statistical Analysis

Readability scores for the original and rephrased texts are presented as mean and SD. Student t test for paired samples was used to test the reduction of readability scores prior to and after the rephrasing. Prior research of German medical texts [3,10,42] yielded a mean readability of 12.46 (SD 1.84) for the WSTF. This means a reduction of 4.5 grade levels would result in the recommended reading level of 8, that is, a score ≤8.0. Given these parameters, a sample size of 4 would be needed to show such an improvement with an alpha error of 0.05 (adjustment for multiple testing according to the Holm-Bonferroni method [62]) and a power of 95%. Sample size was calculated with G*Power 3.1 [63].

After the analysis of the text corpus’ readability scores, the average readability was calculated as WSTF 11.24 (SD 1.29); FRE 35.92 (SD 7.64). Thus, only a reduction of 3.5 grade levels (for WSTF) would be needed. For the FRE metric, an increase of 25 score points is needed for an eighth grade readability level, that is, a FRE score between 60 and 70.

The hypotheses were formulated as follows:

HWSTF|0:μorigμLLM3.5
HWSTF|1:μorigμLLM>3.5

The tests for the FRE metrics were constructed in a similar manner:

HFRE|0: μLLMμorig25
HFRE|1:μLLMμorig>25

In addition, to show if LLMs improved the readability at all, paired t tests were conducted. The tests were constructed as follows:

HWSTF|0: μorigμLLM
HWSTF|1: μorig>μLLM

For the FRE metrics, the hypotheses were:

HFRE|0: μLLMμorig
HFRE|1: μLLM>μorig

Readability of the Original Health Information Texts

Most of the original texts were rated as difficult to quite difficult (average WSTF score 11.24 (SD 1.29), FRE 35.92 (SD 7.64)); this corresponds to 12 years of schooling. The W39 website had the most difficult text (WSTF 13.97, FRE 16.74) to read; the W7 website had the text with the best readability (WSTF 8.70, FRE 51.02). Only 2 websites achieved a grade level of 8 (W7, W9). Table 2 presents the calculated WSTF and FRE scores for the original health information texts with their means and SD.

Table 2. Computed readability scores and number of words for 60 medical information texts.
WebsiteWSTFa,bFREc,dNumber of wordse
W19.3643.93950
W210.6341.921021
W310.7044.261007
W49.8341.46784
W510.8036.401909
W611.0141.231131
W78.7051.02907
W810.8434.301017
W98.9047.801279
W1010.6538.821434
W1110.0143.06898
W1212.0028.651214
W1311.6831.91780
W1410.7743.18597
W1512.2833.361205
W169.3546.21661
W1710.3241.83780
W1810.3044.75832
W1910.8539.141321
W2011.9629.32839
W2111.3634.434225
W2211.1134.622999
W2311.9329.02114
W2411.4334.482192
W2511.5538.691058
W269.6545.50660
W2710.9338.60425
W2811.3529.45706
W2911.2727.70648
W3011.8527.67562
W319.2746.621266
W329.1746.232657
W3310.3343.091306
W3411.5035.65760
W359.2046.042672
W3610.8236.201472
W379.5744.361370
W3811.6032.861173
W3913.9716.741343
W4011.9030.391948
W4111.1336.131678
W4211.0837.843960
W4311.3540.01794
W4410.9737.842232
W4511.8730.181236
W4613.3621.971527
W4712.4927.992072
W4813.6624.652063
W4912.0132.931117
W5013.8622.181838
W5111.6237.14762
W5212.5831.801642
W5310.2240.70516
W5414.2819.451199
W5513.8822.921197
W5612.6930.663383
W5712.0232.442541
W5811.4139.02746
W5910.9036.291530
W6012.0331.862411

aWSTF: Wiener Sachtextformel (4th Vienna Formula).

bWSTF mean 11.24 (SD 1.29).

cFRE: Flesch reading ease.

dFRE mean 35.92 (SD 7.64).

eNumber of words, mean 1409 (SD 840).

Readability of the Rephrased Health Information Texts

Overall, the texts rephrased by the LLMs show an improved readability compared to the original texts. However, the degree of the readability improvements varies greatly.

ChatGPT-3.5 had, on average, a score of 9.96 (SD 1.52) for WSTF, ChatGPT-4o had a mean score of 10.6 (SD 1.37), Microsoft Copilot had a mean score of 8.99 (SD 1.10), and Le Chat had a mean score of 11.7 (SD 1.47). Microsoft Copilot achieved the best readability values compared to the other LLMs (see Table 3).

Table 3. Computed readability scores and number of words with mean readability score and SDs, and average differences of original and large language model texts.
ChatGPT-3.5ChatGPT-4oMicrosoft CopilotLe Chat
WebsiteWSTFaFREbWordsWSTFFREWordsWSTFFREWordsWSTFFREWords
W19.8146.1724210.2539.874968.3551.7284510.1339.35446
W28.3856.142869.7243.592818.7255.3871011.2740.76798
W310.8541.5830511.3634.355018.6053.7181712.5634.88471
W49.5241.8036411.7835.223707.69c54.2961011.1536.23456
W59.5943.0018913.1723.162738.4549.52154111.8431.16914
W610.6345.5818212.5129.023689.2752.5784112.0636.92518
W78.6047.1131011.9731.225656.7860.427468.6051.02540
W811.5031.8924710.9435.655489.5139.0583910.7533.19898
W97.0155.5439211.4531.893687.5054.589059.0946.91884
W109.4645.6837511.1036.704048.1054.06127212.5926.46502
W1110.9738.8724612.4132.792897.8057.7071113.6225.75359
W1210.0144.822789.8944.5537110.5036.7586111.0836.80385
W1311.0439.632819.9342.853167.4960.9452912.8527.51588
W1411.5842.3019511.0442.114258.4255.9451913.1333.64433
W1511.9838.4542211.5130.8033511.1642.22110713.5526.82476
W167.9056.3624010.2443.864037.5854.605189.7447.35304
W1710.7940.3224413.1828.214147.4853.5046210.4341.20425
W186.8660.5632812.9829.824258.9352.2467011.4240.27694
W199.7446.1640211.3234.6335710.1845.1598710.9939.99381
W2010.1534.1717910.2341.843719.9941.2665811.9326.95483
W218.6450.2656911.1034.6350111.1036.22517011.6733.53635
W2210.3643.122079.9045.814859.4744.02214011.6730.06358
W239.6443.2311310.2941.885229.3144.3017110.3136.48140
W2413.1721.9161410.2840.5052710.2740.57161913.5119.93603
W2510.0750.333049.8743.504888.9751.3363311.0741.37408
W267.9856.192257.8650.002987.6857.5846010.2645.04443
W278.7649.903368.8441.823427.0458.113889.6645.03370
W2811.3231.9026811.3231.765629.5843.3151512.6925.02350
W2912.0827.9921111.2836.534629.3544.2045612.5522.17356
W3011.4327.472788.0956.1046711.0434.2839211.2427.05328
W3110.4539.313058.7946.064468.2953.77106711.1136.72406
W327.9455.032489.1940.506948.1352.4514869.7043.851543
W338.6450.9930710.7144.533667.9353.6677512.7935.38344
W349.2048.8620210.0646.794559.4549.5374313.1736.34199
W358.4549.271919.9939.924728.2051.35217511.1237.731812
W367.4558.252666.6959.173888.4150.828999.7743.78351
W3710.2746.752229.9141.155018.0454.91100211.2839.32621
W3811.0235.7720711.5936.2858410.0242.7279712.5427.26343
W3911.4438.792697.8654.245509.8445.8580513.9519.90331
W4010.6235.433329.5842.174099.7140.90110414.8811.25221
W4110.5144.052667.6254.263569.3248.0295411.1034.851022
W4211.0744.534589.3545.181559.0453.37229111.3537.162792
W4310.6143.493537.7455.853358.7853.9362512.4034.65378
W447.9255.982229.6239.963398.0953.08140514.4316.29365
W458.7148.0125910.3136.583919.0247.507499.1946.47677
W4610.9429.0521410.7235.6435011.5237.48126013.6220.52610
W479.8247.6831411.4433.514368.4652.97145412.0934.95428
W489.4442.0824010.1542.423859.3450.15145413.9819.24485
W496.3461.812109.5451.024509.2845.5474411.1635.22662
W5011.4436.951809.9442.4060410.5439.41129411.2836.13474
W5110.7639.572399.7346.804429.8245.8956411.2837.34540
W528.6652.0525311.0739.8341810.6840.9811014.4824.01387
W536.5857.1316610.0942.203807.6955.603349.6843.02287
W549.8747.542248.6751.233339.6546.25238012.6629.66494
W557.9252.572619.8836.973829.8445.2286714.3718.19582
W5611.0543.5526110.3644.534149.4646.08177011.4538.48797
W578.3250.9430410.4739.104028.7451.6389910.9531.65291
W587.8956.2834010.7041.634918.0158.1156610.8340.04348
W598.0847.4927711.7536.885398.2948.90108510.8133.20733
W6010.3344.922949.2249.623889.7644.55165911.6931.57742
Mean (SD)9.96 (1.52)45.04 (8.62)278 (88)10.60 (1.37)39.23 (7.45)749 (94)8.99 (1.10)49.00 (6.51)1040 (743)11.71 (1.47)33.72 (8.58)570 (406)
DIFFd (DIFF_SDe)1.54 (1.68)9.13 (8.90)f0.93 (2.06)4.94 (11.78)2.24 (0.98)13.09 (5.88)−0.47 (1.33)−2.20 (7.15)

aWSTF: Wiener Sachtextformel (4th Vienna Formula).

bFRE: Flesch reading ease.

cItalic font denotes that the target readability (WSTF≤8, FRE≥60) was reached.

dDIFF: difference.

eDIFF_SD: SD difference.

fNot applicable.

Microsoft Copilot achieved the highest average score of 49.0 (SD 6.51) on the readability metric FRE, while Le Chat came last with 33.72 (SD 8.58). ChatGPT-3.5 generated texts with, on average, the fewest words (278, SD 278 words), while Microsoft Copilot generated texts with the most words (1040, SD 743 words) but still less than the original texts.

The ChatGPT-based models (ChatGPT-3.5, ChatGPT-4o, and Microsoft Copilot) achieved an average improvement of 1.54 (SD 1.68), 0.93 (SD 2.06), and 2.24 (SD 0.98) grade levels, respectively, for the WSTF.

ChatGPT-3.5 reached the desired class level of eighth grade for 20 texts; Microsoft Copilot reached this level for half of the texts (see Table 3 and Figure 1). Notably, the newer ChatGPT-4o achieved this for only 5 texts.

Figure 1. Distribution of calculated WSTF scores for GPT-3.5, GPT-4o, Microsoft Copilot, and Le Chat. The fifth column shows the distribution of the readability scores of the original texts. The dashed line indicates the recommended readability score of 8. WSTF: Wiener Sachtextformel (4th Vienna Formula).

Le Chat did not reach the eighth grade (or lower) for any text. By contrast, the average difference of −0.47 indicates that this LLM tends to decrease the readability. This was also reflected in the statistical tests. For both the WSTF and FRE metrics, the hypotheses that the mean readability improved (HWSTF|1 and HFRE|1) could not be accepted.

The FRE scores of the rephrased texts improved for GPT-3.5, GPT-4o, and Microsoft Copilot by 9.13, 4.94, and 13.09, respectively (see Table 3 and Figure 2). However, the readability of most of the texts was still low, that is, scores below 60.

Figure 2. Distribution of calculated FRE scores for GPT-3.5, GPT-4o, Microsoft Copilot, and Le Chat. The fifth column shows the distribution of the readability scores of the original texts. The dashed line indicates the recommended readability score of 60. FRE: Flesch reading ease.

On average, Le Chat’s texts were 2.2 scores inferior to the original texts, in line with the evaluation of the WSTF metric.

The findings described above are also reflected in the results of the statistical tests: None of the tests for an improvement to the eighth grade level yielded a significant result, that is, alternative hypotheses could not be accepted. However, except for the Le Chat model, it could be shown that the mean readability was improved significantly, that is, the alternative hypotheses could be accepted. In a nutshell, three out of four LLMs achieved a statistically significant readability improvement, yet it was not high enough to have reached the eighth grade level.

Accuracy of the Rephrased Health Information Texts

All LLM answers were screened independently by 3 reviewers. Fleiss κ was 0.264, and the percent agreement was 54.6%. This relatively low agreement reflects the difficulty of evaluating medical content without deep domain-level expertise; ideally, assessments would involve medical doctors, and the reliability of the evaluation is further complicated by uncertainty regarding the correctness of the original websites.

Although not a systematic assessment, several obvious mistakes and LLM hallucinations were discovered: Microsoft Copilot shortened the information about endometrial cancer (W29) into “endometrial cancer is the most common cancer among women in Germany” (all the following examples are translated versions of the original German health information texts and rephrased LLM answers). From an epidemiological perspective, this claim is incorrect, with breast cancer being the most prevalent type of cancer among women, constituting a nonnegligible change of meaning in the rephrased text.

The original text about myocarditis (W49) included the sentence: “Myocarditis is also considered to be an important cause of sudden cardiac death in athletes,” which is difficult to understand for readers and may lead to misinterpretations. This kind of sudden cardiac death occurs during exercise, training, or during a match. This information that is not given in the sentence may just be indicated by using the word “athlete.” The rephrased sentence also bears this ambiguity and does even increase it: “When athletes suddenly die, it is often due to inflammation of the heart muscle.” The ‘context’ of sudden death is omitted.

Missing context is also noticed if verbatim speech and statements by medical experts were included in the original texts. The selected LLMs reduced these statements into plain text, thereby omitting the source of the information. For example, the article about myocarditis (W49) included an expert statement as follows: “You should always go to the doctor if you notice symptoms that you are not aware of, says Dr. Milan Dinic, a cardiologist in private practice from Munich.” “Particularly in women, any new complaint between the tip of the nose and the navel is usually heart related. You should therefore definitely think about your heart.”

ChatGPT-3.5 rephrased this to “You should always see a doctor if you notice any new symptoms. In women in particular, many symptoms can indicate heart disease.”


Principal Results

The original medical texts extracted from health information websites are, on average, difficult (for the FRE metric) or fairly difficult (for the WSTF) to read. This means that the original texts use complicated sentence structures and/or complex specialist terminology. Our study showed that LLMs can help improve the readability, especially for the models ChatGPT-3.5 and Microsoft Copilot.

ChatGPT-3.5 and Microsoft Copilot were able to reduce text. However, the accuracy of the content must be checked by medical experts to make sure that no ambiguous or false statements were introduced. It is well known that LLMs tend to hallucinate [36,64] or “escalate the minor biases that could occur in the data bank with which it gets trained” [35]. Nevertheless, the authors postulate that the process of “fact-checking” an automatically generated text is more time efficient than manually rewriting medical texts for laypersons. Specialized LLMs or LLMs fine-tuned for medical texts could also be a possible solution to increase the correctness and reliability of generated outputs [35] and thus make this text generation process even more time- and cost-efficient.

The authors found that LLMs moderately increased readability. This is in line with the research by Li [32]. For radiology reports, ChatGPT 3 produces texts that improved the FRE by 45.5 points.

In our analyses, the FRE improvements were 9.13 (ChatGPT-3.5), 4.94 (ChatGPT-4o), and 13.09 (Microsoft Copilot). This might indicate that the rephrasing of texts works better for texts originally written in English. In addition, Srinivasan et al [34] report FRE improvements in a similar range for GPT-3.5 (16.07) and for GPT-4o (5.4).

Limitations

As the aims of the study were to reflect the experience of an average layperson seeking health information online, no advanced prompt optimization techniques were investigated. While more robust prompts might yield different results, the authors consider it unlikely that nonexpert users would engage in systematic prompt tuning. In addition, reproducibility is hindered by the fact that laypersons won’t experiment with LLM model parameters such as temperature. Moreover, tuning model parameters over the chat interfaces isn’t possible in all cases and requires API access. In this context, the authors assume that a high fraction of laypersons do not have the necessary technical background to experiment with LLM APIs and related programming languages.

Additionally, the exact model version of the LLMs used in this study are no longer publicly available. Hence, as in most LLM-based studies, both the selected LLMs and the examined website texts are snapshots in time. The LLM field is evolving rapidly, and reproducibility of the results is difficult.

Another aspect is that the texts taken from the websites may also change over time. The appearance and formatting of the individual web pages were deliberately not considered in this work: Only raw text material was extracted. However, aesthetic and design features or educational multimedia can influence the understandability of information material.

No dedicated German LLM was used in this study. It would be interesting to replicate this study with a fine-tuned German LLM. In 2024, the LLM community has a strong focus on English training data and models, and the performance is lower for other languages [36]. Heilmeyer et al [18] noted that specific, on-premise trained German models performed better. However, typical patients or citizens seeking health information will neither have the technical skills or knowledge nor the specialized hardware available to do this on their own.

The systematic evaluation of the (medical) accuracy of rephrased PEMs was beyond the study’s scope, but future interdisciplinary research involving medical experts could address this. Moreover, a follow-up study could more deeply investigate the readability and correctness from a technical point of view by using APIs instead of relying on publicly available chat interfaces. In this context, more recent LLMs could be benchmarked with the same quality-controlled set of text material in an end-to-end evaluation pipeline.

Comparison With Prior Work

If LLMs were used to answer patient-centric questions about hip arthroplasty, Mika et al [31] report that patients would be able to understand them. However, they do not calculate a readability metric for the given answers and instead rely on a “Response Rating System.” In contrast, Eng et al [29] results confirm the low readability of answers for patient-centric questions.

Compared to the works by [29,31,32,34,65], this study covered a broader spectrum of medical domains: Cancer, cardiovascular conditions, public health topics, etc.

Similar improvements in terms of readability were found by Ovelman et al [66]: They used Claude 2 LLM to create plain language summaries of 10 evidence reviews. The covered topics range from vaccines, prehospital airway management, and malnutrition in hospitalized adults to breast irradiation for breast cancer. For half of their texts, the recommended sixth to eighth grade reading level was met by the generated summaries.

Lyu et al [65] did not measure the quality of the rephrased reports with readability scores but let them be evaluated by experts. In addition, they found that the effect of prompt engineering was not considered high: “All of the five further-modified prompts were found to produce results similar to those of the original prompt and far worse than those of the optimized prompt”.

This study differs from the previously presented evaluations. Here, only German health information texts were rephrased by LLMs and their readability evaluated.

Innovation

Citizens and patients have been using the Internet for health information seeking for almost two decades. Today, they increasingly consult LLMs in everyday situations: for answers to specific medical questions or for explanations of complex medical texts. This study investigates whether and how LLMs improve the readability of German online medical texts. To the authors’ knowledge, this is the first evaluation of readability metrics for German LLM-rephrased text and original medical text.

Shifting from the perspective of citizens and patients to health professionals or institutions: The use of an LLM could be a time-saving and cost-effective tool to fine-tune their information leaflets, online texts, etc to meet different information needs. The study showed that LLMs are already able to moderately improve readability.

Conclusions

The use of LLMs can improve the readability of PEMs in the German language but requires careful expert review to ensure accuracy and completeness of medical information. The improvement is rather moderate, averaging 2‐3 school grades (for the WSTF). Still, this improvement can support patients reading PEMs online.

The selection of the LLM seemed critical to achieve good results, whereas a prompt seemed to be less of an influencing factor.

Some rephrased texts conveyed incorrect messages or took statements out of context. This is a serious risk, especially for medical texts. Therefore, a manual check is still needed and advised when using LLMs in similar scenarios.

Data Availability

The data of this study are available upon reasonable request.

Authors' Contributions

Conceptualization: MP

Data curation: AM

Formal analysis: AM

Investigation: AM

Methodology: AM, RZ, MW

Supervision: MP

Validation: RZ

Visualization: MP

Writing – original draft: MP

Writing – review & editing: AM, RZ, MW

Conflicts of Interest

None declared.

Multimedia Appendix 1

List of content providers and websites.

PDF File, 61 KB

  1. Jacobs W, Amuta AO, Jeon KC. Health information seeking in the digital age: an analysis of health information seeking behavior among US adults. Cogent Soc Sci. Jan 1, 2017;3(1):1302785. [CrossRef]
  2. Alpay L, Verhoef J, Xie B, Te’eni D, Zwetsloot-Schonk JHM. Current challenge in consumer health informatics: bridging the gap between access to information and information understanding. Biomed Inform Insights. Jan 1, 2009;2(1):1-10. [CrossRef] [Medline]
  3. Zowalla R, Pobiruchin M, Wiesner M. Analyzing the readability of health information booklets on cardiovascular diseases. Stud Health Technol Inform. 2018;253:16-20. [Medline]
  4. Basch CH, Fera J, Garcia P. Readability of influenza information online: implications for consumer health. Am J Infect Control. Nov 2019;47(11):1298-1301. [CrossRef] [Medline]
  5. Yeung AWK, Wochele-Thoma T, Eibensteiner F, et al. Official websites providing information on COVID-19 vaccination: readability and content analysis. JMIR Public Health Surveill. Mar 15, 2022;8(3):e34003. [CrossRef] [Medline]
  6. Silva MJ, Santos P. The impact of health literacy on knowledge and attitudes towards preventive strategies against COVID-19: a cross-sectional study. Int J Environ Res Public Health. May 19, 2021;18(10):5421. [CrossRef] [Medline]
  7. McCaffery KJ, Dodd RH, Cvejic E, et al. Health literacy and disparities in COVID-19-related knowledge, attitudes, beliefs and behaviours in Australia. Public Health Res Pract. Dec 9, 2020;30(4):30342012. [CrossRef] [Medline]
  8. Zowalla R, Pfeifer D, Wetter T. Readability and topics of the German health web: exploratory study and text analysis. PLOS ONE. 2023;18(2):e0281582. [CrossRef] [Medline]
  9. Rooney MK, Santiago G, Perni S, et al. Readability of patient education materials from high-impact medical journals: a 20-year analysis. J Patient Exp. 2021;8:2374373521998847. [CrossRef] [Medline]
  10. Gordejeva J, Zowalla R, Pobiruchin M, Wiesner M. Readability of English, German, and Russian disease-related Wikipedia pages: automated computational analysis. J Med Internet Res. May 16, 2022;24(5):e36835. [CrossRef] [Medline]
  11. Neves MP, De Almeida AB. Before and beyond artificial intelligence: opportunities and challenges. In: Sousa Antunes H, Freitas PM, Oliveira AL, Martins Pereira C, Vaz de Sequeira E, editors. Multidisciplinary Perspectives on Artificial Intelligence and the Law. Springer International Publishing; 2024:107-125. [CrossRef] ISBN: 978-3-031-41263-9
  12. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature New Biol. Aug 3, 2023;620:172-180. [CrossRef] [Medline]
  13. Yang R, Tan TF, Lu W, Thirunavukarasu AJ, Ting DSW, Liu N. Large language models in health care: development, applications, and challenges. Health Care Sci. Aug 2023;2(4):255-263. [CrossRef] [Medline]
  14. Lois A, Yates R, Ivy M, et al. Accuracy of natural language processors for patients seeking inguinal hernia information. Surg Endosc. Dec 2024;38(12):7409-7415. [CrossRef] [Medline]
  15. Denecke K, May R, Rivera Romero O, LLMHealthGroup. Potential of large language models in health care: Delphi study. J Med Internet Res. May 13, 2024;26:e52399. [CrossRef] [Medline]
  16. Spotnitz M, Idnay B, Gordon ER, et al. A survey of clinicians’ views of the utility of large language models. Appl Clin Inform. Mar 2024;15(2):306-312. [CrossRef] [Medline]
  17. Tepe M, Emekli E. Decoding medical jargon: the use of AI language models (ChatGPT-4, BARD, Microsoft Copilot) in radiology reports. Patient Educ Couns. Sep 2024;126:108307. [CrossRef] [Medline]
  18. Heilmeyer F, Böhringer D, Reinhard T, Arens S, Lyssenko L, Haverkamp C. Viability of open large language models for clinical documentation in German health care: real-world model evaluation study. JMIR Med Inform. Aug 28, 2024;12:e59617. [CrossRef] [Medline]
  19. Chow JCL, Sanders L, Li K. Impact of ChatGPT on medical chatbots as a disruptive technology. Front Artif Intell. 2023;6:1166014. [CrossRef] [Medline]
  20. Swisher AR, Wu AW, Liu GC, Lee MK, Carle TR, Tang DM. Enhancing health literacy: evaluating the readability of patient handouts revised by ChatGPT’s large language model. Otolaryngol Head Neck Surg. Dec 2024;171(6):1751-1757. [CrossRef] [Medline]
  21. Behers BJ, Vargas IA, Behers BM, et al. Assessing the readability of patient education materials on cardiac catheterization from artificial intelligence chatbots: an observational cross-sectional study. Cureus. Jul 2024;16(7):e63865. [CrossRef] [Medline]
  22. Pompili D, Richa Y, Collins P, Richards H, Hennessey DB. Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models. World J Urol. Jul 29, 2024;42(1):455. [CrossRef] [Medline]
  23. Burns C, Bakaj A, Berishaj A, Hristidis V, Deak P, Equils O. Use of generative AI for improving health literacy in reproductive health: case study. JMIR Form Res. Aug 6, 2024;8:e59434. [CrossRef] [Medline]
  24. Roster K, Kann RB, Farabi B, Gronbeck C, Brownstone N, Lipner SR. Readability and health literacy scores for ChatGPT-generated dermatology public education materials: cross-sectional analysis of sunscreen and Melanoma questions. JMIR Dermatol. Mar 6, 2024;7:e50163. [CrossRef] [Medline]
  25. Rouhi AD, Ghanem YK, Yolchieva L, et al. Can artificial intelligence improve the readability of patient education materials on aortic stenosis? A pilot study. Cardiol Ther. Mar 2024;13(1):137-147. [CrossRef] [Medline]
  26. Shah YB, Ghosh A, Hochberg AR, et al. Comparison of ChatGPT and traditional patient education materials for men’s health. Urol Pract. Jan 2024;11(1):87-94. [CrossRef] [Medline]
  27. Halawani A, Almehmadi SG, Alhubaishy BA, Alnefaie ZA, Hasan MN. Empowering patients: how accurate and readable are large language models in renal cancer education. Front Oncol. 2024;14:1457516. [CrossRef] [Medline]
  28. Guerra GA, Grove S, Le J, et al. Artificial intelligence as a modality to enhance the readability of neurosurgical literature for patients. J Neurosurg. Apr 1, 2025;142(4):1189-1195. [CrossRef] [Medline]
  29. Eng E, Mowers C, Sachdev D, et al. Chat generative pre-trained transformer (ChatGPT) – 3.5 responses require advanced readability for the general population and may not effectively supplement patient-related information provided by the treating surgeon regarding common questions about rotator cuff repair. Arthroscopy. Jan 2025;41(1):42-52. [CrossRef] [Medline]
  30. Charnock D, Shepperd S, Needham G, Gann R. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health. Feb 1999;53(2):105-111. [CrossRef] [Medline]
  31. Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am. Oct 4, 2023;105(19):1519-1526. [CrossRef] [Medline]
  32. Li H, Moon JT, Iyer D, et al. Decoding radiology reports: potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin Imaging. Sep 2023;101:137-141. [CrossRef] [Medline]
  33. Moons P, Van Bulck L. Using ChatGPT and Google Bard to improve the readability of written patient information: a proof of concept. Eur J Cardiovasc Nurs. Mar 12, 2024;23(2):122-126. [CrossRef] [Medline]
  34. Srinivasan N, Samaan JS, Rajeev ND, Kanu MU, Yeo YH, Samakar K. Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources. Surg Endosc. May 2024;38(5):2522-2532. [CrossRef] [Medline]
  35. Pal S, Bhattacharya M, Lee SS, Chakraborty C. A domain-specific next-generation large language model (LLM) or ChatGPT is required for biomedical engineering and research. Ann Biomed Eng. Mar 2024;52(3):451-454. [CrossRef] [Medline]
  36. Alonso I, Oronoz M, Agerri R. MedExpQA: multilingual benchmarking of large language models for medical question answering. Artif Intell Med. Sep 2024;155:102938. [CrossRef] [Medline]
  37. Weiss BD. Health Literacy - A Manual for Clinicians. American Medical Association Foundation and American Medical Association; 2003. URL: http://lib.ncfh.org/pdfs/6617.pdf [Accessed 2024-11-08] ISBN: 1-57947-502-7
  38. How to write easy-to-read health materials. US National Library of Medicine. 2017. URL: https://webcitation.org/6zBeCFhPU [Accessed 2024-11-08]
  39. FLESCH R. A new readability yardstick. J Appl Psychol. Jun 1948;32(3):221-233. [CrossRef] [Medline]
  40. Amstad T. Wie Verständlich Sind Unsere Zeitungen. Studenten-Schreib-Service; 1978. URL: https:/​/books.​google.co.in/​books/​about/​Wie_verst%C3%A4ndlich_sind_unsere_Zeitungen.​html?id=kiI7vwEACAAJ&redir_esc=y [Accessed 2025-12-20]
  41. Bamberger R, Vanecek E. Lesen - Verstehen - Lernen - Schreiben: Die Schwierigkeitsstufen von Texten in Deutscher Sprache [Book in German]. Jugend und Volk; 1984. URL: https:/​/search.​worldcat.org/​fr/​title/​lesen-verstehen-lernen-schreiben-die-schwierigkeitsstufen-von-texten-in-deutscher-sprache/​oclc/​12137245 [Accessed 2025-12-20] ISBN: 9783224152502
  42. Wiesner M, Zowalla R, Pobiruchin M. The difficulty of German information booklets on psoriasis and psoriatic arthritis: automated readability and vocabulary analysis. JMIR Dermatol. 3(1):e16095. [CrossRef]
  43. About GitHub. GitHub. 2024. URL: https://github.com/about [Accessed 2024-11-07]
  44. Hugging Face. 2024. URL: https://huggingface.co [Accessed 2025-12-20]
  45. GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. OpenAI. 2024. URL: https://openai.com/index/gpt-4/ [Accessed 2024-10-21]
  46. Gemini. 2024. URL: https://gemini.google.com [Accessed 2024-10-21]
  47. Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics; 2019:4171-4186. URL: http://aclweb.org/anthology/N19-1 [CrossRef]
  48. Llama 2: open source, free for research and commercial use. Llama. 2024. URL: https://www.llama.com/llama2 [Accessed 2024-10-21]
  49. Claude 2. Anthropic. Jul 11, 2023. URL: https://www.anthropic.com/news/claude-2 [Accessed 2024-10-21]
  50. Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. Jan 2020;21(1):5485-5551. URL: https://dl.acm.org/doi/10.5555/3455716.3455856 [Accessed 2025-12-20]
  51. BigScience large open-science open-access multilingual language model. Hugging Face. 2024. URL: https://huggingface.co/bigscience/bloom [Accessed 2024-10-21]
  52. Copilot. Microsoft. 2024. URL: https://www.microsoft.com/en-US/microsoft-copilot/personal-ai-assistant [Accessed 2024-10-21]
  53. Falcon LLM. 2024. URL: https://falconllm.tii.ae/ [Accessed 2024-10-21]
  54. Mistral AI. 2024. URL: https://mistral.ai/ [Accessed 2024-10-21]
  55. Uhlig M. DRXD1000/Phoenix-7B. Hugging Face. 2024. URL: https://huggingface.co/DRXD1000/Phoenix [Accessed 2024-10-21]
  56. Plüster B, Schuhmann C. LeoLM/leo-hessianai-13b. Hugging Face. 2024. URL: https://huggingface.co/LeoLM/leo-hessianai-13b [Accessed 2024-10-21]
  57. Han T, Adams LC, Papaioannou JM, et al. MedAlpaca–an open-source collection of medical conversational AI models and training data. ArXiv. Preprint posted online on Apr 14, 2023. [CrossRef]
  58. stanford-crfm/BioMedLM. Hugging Face. 2024. URL: https://huggingface.co/stanford-crfm/BioMedLM [Accessed 2024-10-21]
  59. Petroni F, Rocktäschel T, Riedel S, et al. Language models as knowledge bases? In: Wan X, Jiang J, Ng V, Wan X, editors. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics; 2019:2463-2473. URL: https://www.aclweb.org/anthology/D19-1 [CrossRef]
  60. Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Presented at: NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems; Nov 28 to Dec 9, 2022:24824-24837; New Orleans, LA. URL: https://dl.acm.org/doi/10.5555/3600270.3602070 [Accessed 2025-12-20]
  61. Sun X, Li X, Li J, et al. Text classification via large language models. In: Bouamor H, Pino J, Bali K, editors. Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics; 2023:8990-9005. URL: https://aclanthology.org/2023.findings-emnlp [CrossRef]
  62. Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6(2):65-70. [CrossRef]
  63. Faul F, Erdfelder E, Buchner A, Lang AG. Statistical power analyses using G*Power 3.1: tests for correlation and regression analyses. Behav Res Methods. Nov 2009;41(4):1149-1160. [CrossRef] [Medline]
  64. Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. Dec 31, 2023;55(12):1-38. [CrossRef]
  65. Lyu Q, Tan J, Zapadka ME, et al. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Vis Comput Ind Biomed Art. May 18, 2023;6(1):9. [CrossRef] [Medline]
  66. Ovelman C, Kugley S, Gartlehner G, Viswanathan M. The use of a large language model to create plain language summaries of evidence reviews in healthcare: a feasibility study. Cochrane Evid Synth Methods. Feb 2024;2(2):e12041. [CrossRef] [Medline]


AI: artificial intelligence
API: application programming interface
FRE: Flesch reading ease
LLM: large language model
PEM: patient education material
WSTF: Wiener Sachtextformel (4th Vienna Formula)


Edited by Hongfang Liu; submitted 08.May.2025; peer-reviewed by Barrie Robison, Namra Bhadreshkumar Shah; final revised version received 30.Oct.2025; accepted 26.Nov.2025; published 23.Jan.2026.

Copyright

© Amela Miftaroski, Richard Zowalla, Martin Wiesner, Monika Pobiruchin. Originally published in JMIR AI (https://ai.jmir.org), 23.Jan.2026.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR AI, is properly cited. The complete bibliographic information, a link to the original publication on https://www.ai.jmir.org/, as well as this copyright and license information must be included.