Background

JMIR AI

2817-1705

JMIR Publications

Toronto, Canada

v5i1e73481

10.2196/73481

Review

Natural Language Processing of Clinical Notes for Cancer Research and Patient Care Prior to Widespread Adoption of Generative AI: Scoping Review

Kayira

Alfred B

MSc, MPH, MRES1*Elyazori

Hadeel R A

MSc2*Lybarger

Kevin

PhD2Walter

Fiona M

MA, MD1Chelala

Claude

PhD3Funston

Garth

MB BChir, PhD1

Centre for Cancer Screening, Prevention, and Early Diagnosis, Wolfson Institute of Population Health, Queen Mary University of London

Charterhouse Square

London

United KingdomDepartment of Information Sciences and Technology, College of Engineering & Computing, George Mason University

Fairfax

United StatesBarts Cancer Institute, Queen Mary University of London

London

United Kingdom

Coristine

Andrew

García-Barragán

Álvaro

Chrimes

Dillon

Adegoke

Kola

Correspondence to Alfred B Kayira, MSc, MPH, MRES, Centre for Cancer Screening, Prevention, and Early Diagnosis, Wolfson Institute of Population Health, Queen Mary University of London, Charterhouse Square, London, EC1M 6BQ, , United Kingdom, 44 7415302686; a.b.kayira@qmul.ac.uk*

these authors contributed equally

2026

1452026

e73481

050320251502202616022026

2026

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR AI, is properly cited. The complete bibliographic information, a link to the original publication on https://www.ai.jmir.org/, as well as this copyright and license information must be included.

Background

Clinical notes are the most abundant data type within electronic health records; however, their highly unstructured format presents significant challenges for supervised natural language processing (NLP) methods. The NLP community is increasingly adapting large language models to analyze clinical notes, achieving strong performance and generalizability with minimal task-specific fine-tuning. We conducted a scoping review of NLP methods applied to clinical notes prior to widespread adoption of generative artificial intelligence (AI) to establish a pre–large language model methodological baseline, showcase potential clinical utility, and highlight key challenges and limitations of extractive, supervised techniques that generative AI approaches may need to overcome.

Objective

This review aimed (1) to characterize the clinical notes used, (2) to identify NLP techniques used to analyze these notes, (3) to determine the clinical applications of NLP in cancer research and patient care, and (4) to highlight challenges and limitations of traditional pregenerative AI methods.

Methods

We systematically searched MEDLINE, Embase, Scopus, and Web of Science for English-language studies published from January 1, 2014, to March 8, 2024. Retrieved references were imported into Covidence, a web-based platform that streamlines management of reviews. Two authors (ABK and HRAE) independently screened studies for eligibility and extracted data using a predefined data extraction template.

Results

A total of 226 studies were included in the review. Research using NLP to derive insights from clinical notes grew significantly, from 4 studies in 2014 to 43 in 2023. NLP methods have evolved from predominantly rule-based and ontology-driven approaches (2014-2017) to hybrid approaches that combine these with deep neural models such as Bidirectional Encoder Representations from Transformers (2018-2024). Most studies (161/226, 71.2%) developed their systems using small, single-institution datasets. Supervised learning approaches with manually annotated corpora were predominant (181/226, 80.1%). Most studies (174/226, 77%) focused on information extraction, with a few applying the extracted data to downstream tasks such as diagnostic and prognostic classification. Clinical domain pretrained models outperformed general domain pretrained models in the majority (11/16, 68.8%) of studies that evaluated multiple model types. In total, 25 studies compared their NLP-based systems with current practice in their respective clinical settings and reported potential benefits, including improved data coverage and completeness, faster information extraction, and improved classification or prediction accuracy. No studies evaluated the utility or impact of their systems in real-world clinical practice. The most common challenges reported by authors were restricted access to clinical notes (n=39) and limited data (n=18).

Conclusions

The application of NLP to clinical notes in oncology has expanded, but most studies focus on information extraction rather than downstream clinical tasks. Oncology NLP has the potential to advance cancer research and patient care, but barriers remain to robust evaluation and clinical deployment of promising tools. Emerging generative AI approaches will need to overcome these challenges to deliver real-world impact.

natural language processingclinical noteselectronic health recordsclinical NLP challengescancerscoping review

IntroductionBackground

Cancer is a major cause of morbidity and mortality globally [1], with 19.3 million new cases and 10 million deaths reported in 2020 [1]. Incidence is projected to rise by 55% by 2040 due to population growth and aging [2]. Research leveraging real-world data is important to support prevention, early detection, and optimized treatment, and ultimately improve patient outcomes, including survival. Electronic health records (EHRs), digital profiles of patient histories created and managed by health care institutions, provide a valuable real-world data resource for cancer research and improve patient care.

While EHR systems have become increasingly available [3], only a small portion consists of structured data (eg, clinical codes, vital signs, clinical and laboratory measurements, and demographics) that can be easily extracted and analyzed using conventional statistical and machine learning methods. Most data (80%) exist in unstructured forms, including clinical notes, diagnostic reports (eg, pathology and radiology), and images [4], limiting usability [5]. Natural language processing (NLP)—a subfield of artificial intelligence (AI) that enables computers to understand, interpret, and generate human language—offers a promising approach to unlock insights from unstructured clinical narratives such as clinical notes and diagnostic reports, enabling their use in research and patient care.

While both diagnostic reports and clinical notes contain valuable information, they differ in complexity for NLP. Diagnostic reports are typically formal and standardized, making them relatively straightforward to process. In contrast, clinical notes are highly diverse due to variations in recording practices across clinicians and health care institutions [6]. They often feature incomplete sentences, poor punctuation, nonstandard abbreviations, shorthand, ambiguous terms, and spelling errors. These characteristics pose significant challenges for NLP processing, even with advanced methodological approaches such as pretrained language models (PLMs), for example, Bidirectional Encoder Representations from Transformers (BERT) [7-9], which dominated the general NLP domain since the introduction of the BERT model in 2018 [10].

However, recent advances in generative AI are reshaping the field of clinical NLP. Large language models (LLMs)—a subset of PLMs designed for generative tasks (eg, OpenAI’s GPT [11] and Meta’s LLaMA [12])—are transforming clinical NLP by enabling broader generalization with minimal task-specific fine-tuning. LLMs (GPT-4, Gemma3-27B, and DeepSeek-14B), applied using prompt engineering or task-specific fine-tuning, have demonstrated strong performance in extracting treatment histories [13], social and behavioral determinants of health (employment, housing, marital status, alcohol use, tobacco use, and drug use) [14], and neurofibromatosis type 1–relevant phenotypes [15] from clinical notes. Recent review studies highlight increasing interest in the use of LLMs with prompt-based strategies, including zero-shot and few-shot prompting, for information extraction (IE) [16,17], as well as for tasks such as information summarization, translation, and clinical communication [18].

Given their strong early performance, which has generated considerable interest within the NLP community, LLMs may emerge as a dominant approach, potentially replacing traditional supervised deep learning methods (eg, recurrent neural networks [RNNs], convolutional neural networks [CNNs], and BERT-based models). To better understand the value that LLMs add beyond established NLP approaches, we conducted a scoping review of NLP methods applied to cancer clinical notes prior to the widespread use of generative AI, providing a comprehensive overview of pre-LLM methods, their potential clinical utility, and the limitations and challenges likely to extend to generative AI.

Several reviews have examined the application of NLP to clinical notes before the adoption of LLMs; however, none have specifically focused on clinical notes as the primary text. Prior reviews have included clinical notes only as a subset of broader document categories. Only 35% (43/123), 22% (5/23), and 12% (2/17) of studies included in Wang et al [19], Li et al [20], and Gholipour et al [21], respectively, used clinical notes, often alongside other medical documents (eg, radiology and pathology reports). Sangariyavanich et al [22] included 17 studies but did not specify the proportion or extent of clinical note use. Furthermore, these reviews focused on one NLP task or the other, for example, IE [19,21], diagnostic classification [20], and prognostic classification [22]. Broader reviews by Wang et al [23], Sim et al [24], and Sheikhalishahi et al [25] covered studies, which included substantial volumes of clinical notes but were not cancer-specific, limiting their utility to the cancer domain. Additionally, these reviews only include studies published up to 2020, predating the widespread adoption of BERT-based PLMs. Notably, in Sheikhalishahi et al [25], only 3 of the 106 studies used deep learning approaches.

Objectives

This review provides a comprehensive synthesis of NLP applications to clinical notes in cancer research prior to widespread experimentation with LLMs. Unlike prior review that included studies based solely on structured diagnostic reports, we restricted inclusion to studies involving clinical notes (exclusively or in combination with diagnostic reports or other documents), so our findings more closely reflect the distinctive challenges—including acquisition—and methodological choices associated with this particularly complex text. We also diverge from earlier reviews by imposing no restrictions on the NLP task, allowing a broader characterization of cancer-related use cases beyond conventional diagnostic or prognostic classification.

By systematically analyzing pregenerative AI methodologies, this review provides important benchmarks for assessing the real “value add” of LLMs, highlights the limitations of extractive, supervised approaches, and anticipates challenges that may need to be overcome. Specifically, our objectives are (1) to characterize the clinical notes used in NLP studies, including their sources and properties; (2) to identify NLP techniques (including annotation methods) used to analyze these notes and examine how these methodologies have evolved over time; (3) to determine the clinical applications of NLP in cancer research and patient care, including reported clinical impact; and (4) to highlight the challenges encountered by researchers in the field.

Methods

This review follows the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) [26].

Working Definitions

We broadly defined NLP as the application of computational techniques to process and analyze unstructured clinical text. This encompasses a diverse range of methods, including domain-specific dictionaries, medical ontologies (eg, Unified Medical Language System [UMLS]), ontology-based tools (eg, MetaMap and Clinical Text Analysis and Knowledge Extraction System), handcrafted rules or search strings, rule-based tools (eg, ConText and NegEx), classical machine learning models (eg, support vector machine), neural networks (eg, RNN), PLMs (eg, BERT), and LLMs (a subset of PLMs distinguished by their larger parameter scale and enhanced capacity for broad generalization with minimal task-specific fine-tuning [eg, GPT]).

Clinical notes were defined as free-text narratives written by health care providers during patient encounters, documenting patient symptoms and signs, investigations, diagnoses, treatment, or treatment plans. They detail a patient’s social and medical history, disease progression, and outcomes. They are distinguished from diagnostic reports, in that they later provide results of diagnostic investigations or imaging studies, often objective and structured. Clinical notes may, however, contain descriptions and interpretations of diagnostic results from these reports.

Search Strategy and Information Sources

We developed a three-concept search criterion covering (1) NLP, (2) EHR or electronic medical record, and (3) cancer or oncology. Predetermined key terms relating to these concepts were used to search MEDLINE through PubMed. These were further expanded by scanning the titles and abstracts of retrieved records. To avoid missing studies in which clinical notes were only one of several document types and therefore not mentioned in the title or abstract, we intentionally kept the EHR or electronic medical record concept broad. The final search criteria for all 4 databases are provided in Multimedia Appendix 1.

We searched MEDLINE (via PubMed), Embase, Web of Science, and Scopus for primary studies that applied NLP to process and analyze clinical notes to generate actionable information for cancer research or patient care. For PubMed, Embase, and Web of Science, we searched across all available fields. In Scopus, the search was limited to the title, abstract, and keywords fields. We used a mix of MeSH term mappings and exact phrase or term searching to balance the sensitivity and precision of the search. All searches were restricted to English-language publications from January 1, 2014, to March 8, 2024.

Inclusion Criteria

We included peer-reviewed journal papers and conference papers that (1) applied NLP to clinical notes—either exclusively or in combination with other medical documents (eg, pathology, radiology, colonoscopy, or other imaging reports); (2) focused on any part of the cancer care continuum, including screening, diagnosis, staging, treatment, surveillance, outcomes assessment, and risk factor identification or risk stratification; and (3) were conducted in any clinical setting (eg, primary care, outpatient clinics, emergency departments, and hospitals).

Exclusion Criteria

We excluded studies that used non-EHR documents (eg, patient-authored text in online health communities), studies using translated text (eg, from one language to English before applying NLP methods), reviews, editorials, commentaries, abstracts, letters, retracted papers, and veterinary studies.

Study Selection

Study screening (title or abstract and full text) was completed in Covidence (Veritas Health Innovation Ltd), a web-based collaboration software platform that streamlines the production of systematic and other literature reviews. References identified through database searches were imported into Covidence, and duplicates were automatically removed.

Two authors (ABK and HRAE) independently assessed the papers for eligibility based on the title and abstract. Proportionate agreement (the proportion of times that reviewers agree on their assessments) was 96%. Class-specific agreement was 56.2% for the positive (include) class and 97.9% for the negative (exclude) class. Cohen κ, which measures the agreement between 2 reviewers (ABK and HRAE) adjusting for the possibility of agreement occurring by chance, was 0.54. Full-text papers were retrieved for studies that passed the title-abstract screening, and the same authors assessed the full texts for eligibility. Proportionate agreement was 81.5%. Class-specific agreement was 86.3% for the positive (include) class and 71.8% for the negative (exclude) class. Cohen κ was 0.58. At both stages, discrepancies were discussed and resolved through consensus, with reference to the predefined inclusion or exclusion criteria and the operational definitions of key concepts (NLP, clinical notes, and cancer or oncology). When consensus could not be reached, another author (GF or KL.) adjudicated.

Data Extraction and Analysis

A data extraction template was created in Covidence and refined through several iterations until all authors agreed on the final version. Using this template, we extracted data across 37 predetermined variables, which can be classified into 5 categories: study metadata, clinical note characteristics, methods, applications, and challenges. Two authors (ABK and HRAE) extracted data from 10% of the papers. The extracted data were compared, and inconsistencies were discussed. Concordance was high and so the remaining papers were extracted by 1 reviewer (ABK). Extracted data were analyzed descriptively, providing counts and percentages.

Study Quality Assessment

Given the scoping review methodology and our count-based analyses, a risk of bias or quality assessment was not performed [27].

ResultsSearch Results

Figure 1 shows the study selection process used to arrive at the included studies. A total of 10,724 records were identified from the databases. After removing duplicates, 7964 records were screened. Of these, 7607 were excluded at the title and abstract screening stage. In the full-text screening stage, 357 papers were assessed for eligibility, and 131 were excluded. Ultimately, 226 studies met the inclusion criteria.

Figure 1.

PRISMA diagram illustrating the study selection process and reasons for exclusion. EHR: electronic health record; NLP: natural language processing; PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses.

Distribution of Studies by Country

Figure 2A illustrates the distribution of included studies based on the country of institution of affiliation of the major (first or corresponding) authors. The majority were from the United States (133/226, 58.8%), followed by China (20/226, 8.8%) and Spain (18/226, 8%).

Figure 2.

Characterization of included studies. (A) Distribution of studies by country of institution of affiliation of major authors. (B) Document type. (C) Clinical note type. (D) Language of clinical notes and other medical documents. (E) Heterogeneity of the sources of clinical notes and other medical documents. In total, 3 (1.3%) studies had insufficient information to determine the source of clinical notes. (F) Accessibility of data used by the studies (publicly available means that authors used a publicly available corpus [majority] or made their corpus publicly available). (G) Patient characteristics reported in studies. (H) Cancer types targeted by studies. CNS: central nervous system; nos: not otherwise specified.

Characterization of Clinical Notes

Figure 2B shows the document types used across included studies. Out of 226 studies, 114 (50.4%) used clinical notes exclusively, while the remainder used clinical notes and other medical documents, primarily pathology and radiology reports. Progress notes (53/226, 23.5%), consultation notes (46/226, 20.4%), and discharge summaries (45/226, 19.9%) were the most common types of clinical notes used in included studies (Figure 2C). However, in 150 of the 226 (66.4%) studies, authors either used nonspecific terms to describe clinical notes (eg, oncology, urology, and cancer clinic notes) or did not specify the clinical note type. Most of the clinical notes were written in English (156/226, 69%), Spanish (22/226, 9.7%), and Chinese (18/226, 8%; Figure 2D).

Figure 2E illustrates heterogeneity in the sources of clinical notes and other medical documents used in the studies. Most studies (161/226, 71.2%) used documents from a single institution, while 27.4% (62/226) included multi-institution data from the same country. No study used documents from more than 1 country. Regarding data availability, 128 of 226 (56.6%) studies did not provide any statement on the accessibility of the corpora used. A few studies (37/226, 16.4%) indicated that their corpus could be made available upon reasonable request, and 12.4% (28/226) either used publicly available corpora (majority) or made their corpus publicly accessible (Figure 2F).

Nearly half of the studies (110/226, 48.7%) did not provide any information about the characteristics of the patients associated with the clinical notes they used. When reported, common characteristics included age (96/226, 42.5%), sex or gender (71/226, 31.4%), race (56/226, 24.8%), cancer therapy or management (57/226, 25.2%), and cancer stage or metastasis (49/226, 21.7%; Figure 2G). The most commonly studied cancers were breast (65/226, 28.8%), lung (60/226, 26.5%), colorectal (32/226, 14.2%), and prostate (29/226, 12.8%; Figure 2H).

NLP Publications and Methods Used by Calendar Year

Figure 3 illustrates the number of studies published annually from January 2014 to March 2024, along with the NLP methods applied to clinical notes. The number of publications per year increased from 4 in 2014 to 43 in 2023.

Figure 3.

Model architectures used to analyze clinical notes over the years. Percentages are relative to the number of studies published in that year. The line graph depicts the number of published studies per year. *2024 is a partial year; it includes papers published from January 1, 2024, to March 8, 2024. It is common for researchers to use multiple methods from the same class or different classes (either as discrete models or in hybrid architectures), leading to double-counting. “Pretrained language models” refers to general-domain pretrained models (eg, BERT and GPT), while “pretrained clinical models” refers to models with domain-specific pretraining on clinical or biomedical text (eg, BioBERT, ClinicalBERT, and PubMedBERT). These categories are mutually exclusive. BERT: Bidirectional Encoder Representations from Transformers.

NLP methods have evolved over time. Between 2014 and 2017, only ontologies, rule-based approaches, and discrete models were used. Studies using neural networks were first published in 2018, followed by PLMs in 2019 (Figure 3). While neural networks, including PLMs, have gained popularity since their introduction, ontologies, rule-based approaches, and discrete models remained the most prevalent approaches throughout the review period. However, rule-based approaches and ontologies were often used in hybrid workflows, serving specific preprocessing and postprocessing roles, rather than as standalone solutions. Out of 226 studies, only 7 (3.1%) and 27 (11.9%) exclusively used ontologies and rule-based methods, respectively.

Fine-Grained Classification of NLP Methods

Ontologies were used in 87 of 226 (38.5%) studies, with domain-specific or customized dictionaries being the most common approach (42/87, 48.3%), followed by the UMLS at 41.4% (36/87; Table 1). These knowledge resources often supported machine learning and neural models by providing seed terms or domain expertise. Off-the-shelf tools such as MetaMap and Clinical Text Analysis and Knowledge Extraction System, which rely on UMLS mappings to analyze biomedical text, were also used.

Table 1.

Breakdown of methods used in included studies (N=226)^a.

Model architecture	Values (N=226), n (%))
Ontologies (n=87)
Domain-specific dictionary	42 (48.3)
Unified Medical Language System	36 (41.4)
MetaMap	16 (18.4)
cTAKES^b	10 (11.5)
NCBO^c BioPortal	7 (8)
MedTagger	3 (3.4)
Other	6 (6.9)
Rule-based (n=112)
Rules or RegEx^d	112 (100)
Discrete models (n=87)
Support vector machine	29 (33.3)
Trees	28 (32.2)
Logistic regression	18 (20.7)
Conditional random fields	16 (18.4)
Clustering	15 (17.2)
Other	11 (12.6)
Naive Bayes classifier	5 (5.7)
K-nearest neighbors classifier	3 (3.4)
Linear regression	2 (2.3)
Neural networks (n=53)
Recurrent neural network	34 (64.2)
Convolutional neural network	21 (39.6)
Feed forward neural networks	10 (18.9)
Capsule networks	1 (1.9)
Pretrained language models (n=41)
BERT^e	39 (95.1)
ChatGPT	1 (2.4)
Google Bard	1 (2.4)
Pretrained clinical models (n=23)
Clinical BERT	23 (100)

^aNumber of model types per study: 91 (40.3%) studies used 1 model type, 92 (40.7%) studies used 2 model types, 26 (11.5%) studies used 3 model types, and 10 (4.4%) studies used 4 model types. Number of model subtypes per study: 76 (33.6%) studies used 1 model subtype, 72 (31.9%) studies used 2 model subtypes, 39 (17.3%) studies used 3 model subtypes, 25 (11.1%) studies used 4 model subtypes, and 4 (1.8%) studies used 5 model subtypes. Pretrained language models are general-domain pretrained models (eg, BERT and GPT), while pretrained clinical models are models pretrained on clinical or biomedical text (eg, BioBERT, ClinicalBERT, and PubMedBERT).

^bcTAKES: Clinical Text Analysis and Knowledge Extraction System.

^cNCBO: National Center for Biomedical Ontology.

^dRegEx: a rule-based algorithm for negation detection in clinical text.

^eBERT: Bidirectional Encoder Representations from Transformers.

Rule-based methods, including handcrafted rules and off-the-shelf tools such as clinical RegEx and ConText, were used in 112 of 226 (49.6%) studies (Table 1), making them the most prevalent, but rarely used in isolation. Rule-based approaches were used in 53 of 114 (46.5%) studies that analyzed clinical notes exclusively and in 64 of 112 (57.1%) studies that analyzed clinical notes in combination with other medical documents. Although the latter proportion was 10.6 percentage points higher, this difference was not statistically significant (2-proportion z test: z=−1.60; P=.11).

Discrete models, encompassing classical machine learning and statistical methods, were used in 87 of 226 (38.5%) studies (Table 1). The most common approaches under this category included support vector machines (29/87, 33.3%), tree-based models including random forest (28/87, 32.1%), logistic regression (18/87, 20.7%), conditional random fields (16/87, 18.4%), and clustering algorithms (15/87, 17.2%). Conditional random field was often applied as a classification layer in neural models like long short-term memory and CNN.

Neural networks featured in 53 of 226 (23.5%) studies, with RNN (34/53, 64.2% ) and CNN (21/53, 39.6%), being the most popular in this category (Table 1). RNNs were dominated by long short-term memory architectures.

PLMs were used in 41 of 226 (18.1%) studies. These were primarily BERT-based models, with only 2 of the 41 (0.9%) studies [28,29] using LLMs (ChatGPT and Google Bard; Table 1). Pretrained clinical models—BERT-based models pretrained on clinical or biomedical corpora (eg, Bio_ClinicalBERT)—were used in 23 of 226 (10.2%) studies (Table 1). Among 23 studies that implemented pretrained clinical models, 16compared clinical domain pretrained models to general domain pretrained models. Clinical domain models outperformed general domain models in 11 of 16 (68.8%) studies, while general domain models performed better in the remaining 5 studies (Multimedia Appendix 2).

Methods for Non-English Corpora

Out of 226 studies, 70 (40%) developed models for non-English clinical notes. Of these, 59 (84.3%) implemented language-specific pipelines built from rules and classical machine learning with engineered features, including some hybrid combinations. Pretrained approaches were present but less common and not mutually exclusive across studies: language-specific pretrained models in 11 of 70 (15.7%) studies, multilingual pretrained models in 7 of 70 (10%) studies, language-specific biomedical or clinical pretrained models in 6 of 70 (8.6%) studies, and language-adapted models in 3 of 70 (4.3%) studies. Language-adapted models typically consisted of models pretrained in English and then further trained on the target language (Multimedia Appendix 3).

In total, 12 studies compared multiple model families. Language-specific biomedical or clinical pretrained models most often yielded the best performance (n=4) [30-33], followed by language-specific pretrained models (n=3) [34-36] and language-adapted pretrained models (n=2) [37,38]. In the remaining 3 studies, the best-performing models were a biomedical or clinical pretrained model [39], a language-specific model [40], and a multilingual pretrained model [41].

Text Representation Methods

Figure 4 illustrates the text representation and vectorization methods used in the studies. Out of 226 studies, 120 (53.1%) used at least 1 representation method. From 2015 to 2017, statistical methods including bag of words, n-grams, and term frequency-inverse document frequency were prevalent. In 2018, context-free embeddings (one fixed vector for each word or token regardless of the context in which it is used, eg, Word2Vec, GloVe, and FastText) and contextual embeddings (a new vector assigned to each word or token depending on the surrounding context, eg, BERT and GPT) were introduced and became the predominant approaches. It was common for studies to test multiple embedding methods to identify the best-performing approaches.

Figure 4.

Text representation and embedding methods (n=120). Context-free embeddings include Word2Vec, FastText, and GloVe. N-grams include continuous bag of words, skip-gram, bigrams, and trigrams. Groups are not mutually exclusive—studies may appear in more than 1 category. TF-IDF: term frequency-inverse document frequency.

Size of Labeled Data Used to Train and Evaluate NLP Systems

Figure 5 shows the data size (clinical notes, with or without additional medical documents, and patients) used to train and evaluate NLP systems. The median number of documents per partition was fewer than 1000, and the median number of patients associated with these notes was also under 1000. For example, the median number of training documents, test documents, training patients, and test patients was 838 (IQR 439-3905), 300 (IQR 120-1504), 606 (IQR 202-1337), and 231.5 (IQR 86-599), respectively.

Training and test sets were generally created through random splits, except in 3 studies where the test cohort came from a slightly different patient population (prospective palliative radiation cohort vs metastatic cancer retrospective registry–based cohort) [42], a different but overlapping time period with the training cohort [43], a different nonoverlapping time period with the training cohort [44], or where the test cohort had a shorter follow-up time than the training cohort (4 vs 5 years) [45].

Figure 5.

Size of the data used in model development and evaluation. Documents refer to entire clinical notes or reports or sentences (a small number of studies reported corpus size in sentences). To cater for instances where train or test split was not specified, we report total data sums (ie, all documents and all patients) as provided by the authors. The number below each boxplot indicates the count of studies reporting data size in that category.

Annotation Methods for Reference Corpus

The majority of studies (181/226, 80.1%) trained and evaluated their systems on corpora that were manually annotated by humans. Few studies (7/226, 3.1%) trained models using weakly supervised labels but evaluated them on human-curated labels. A considerable proportion of studies (38/226, 16.8%) either relied on existing labels within the EHR (eg, International Classification of Diseases or ICD codes) or developed unsupervised systems, for which manual annotation was not applicable. A summary of annotation methods is provided in Multimedia Appendix 4.

Implementation Type and Evaluation

Table 2 summarizes model implementation type, evaluation metrics, and whether models were externally evaluated. Most studies (179/226, 79.2%) developed new models or retrained or fine-tuned an existing one, while 19.5% (44/226) used existing models without retraining. The latter group included studies that used off-the-shelf tools such as MetaMap or repurposed existing models for new extraction tasks.

Evaluation metrics varied by task, with the most commonly reported being recall (155/226, 68.6%), precision (153/226, 67.7%), F₁-score (136/226, 60.2%), accuracy (44/226, 19.5%), area under the receiver operating characteristic curve (40/226, 17.7%), and specificity (30/226, 13.3%). While metrics such as recall, precision, and F₁-score were widely used and therefore suitable for summarization, variability in clinical corpora and tasks precluded comparison on NLP methods. Only 21 of 226 (9.3%) studies evaluated their systems on external corpora.

Table 2.

Model implementation and evaluation (N=226)^a.

Implementation or evaluation	Values, n (%)
Implementation type
New model	179 (79.2)
Existing model	44 (19.5)
Reported evaluation metrics
Recall	155 (68.6)
Precision	153 (67.7)
F₁-score	136 (60.2)
Accuracy	44 (19.5)
AUC-ROC^b	40 (17.7)
Specificity	30 (13.3)
Cohen κ	5 (2.2)
Cosine similarity	3 (1.3)
Mean average precision	2 (0.9)
Other	44 (19.5)
External evaluation
Yes	21 (9.3)
No	158 (69.9)

^aSome studies lacked sufficient information to assess external evaluation; for example, those that used existing tools had their detailed data documented elsewhere.

^bAUC-ROC: area under the receiver operating characteristic curve.

Clinical Applications of NLP

Figure 6A summarizes the clinical applications of NLP to clinical notes. IE was the most common task, with 77% (174/226) of the studies. In 50.9% (115/226) of the studies, NLP was exclusively used for IE. Diagnostic classification was performed in 62 of 226 (27.4%) studies, while trials or cohort matching was the goal in 16 of 226 (7.1%) studies. Other notable applications included prognostic classification (n=14), concept normalization (n=14), and topic modeling (n=11). It was not uncommon, however, for a study to undertake multiple tasks, often with the output of one task feeding into subsequent tasks.

Figure 6.

NLP clinical applications with clinical notes. (A) Number of studies per clinical application. (B) Number of clinical applications per year (percentages are relative to the number of papers published in that year). Diagnostic classification refers to document-level or patient-level classification tasks, for example, distinguishing between notes with metastasis and those without metastasis. Prognostic classification refers to predicting that some clinical event of interest will occur within a specified time period in the future, for example, lung cancer recurrence 2 years following lobectomy. NLP: natural language processing.

A subset of studies (n=15) that focused on IE also extracted temporal information. Some studies formulated this task as a document-time relation (DocTimeRel) classification, where events were assigned a temporal relation to the document creation time (before, after, overlap, or before or overlap) [46-48]. Others used an event-date relation classification formulation, classifying event-time pairs as before, after, overlap, or before or overlap [49] or directly linking events to their corresponding dates through contextual pairing [50,51]. One study constructed patient-level temporal timelines by assigning events to coarse temporal bins (way before admission, before admission, admission, after admission, and discharge) and then temporally ordered them within and across documents [52]. Less complex approaches included proximity- or context-based methods (linking events to nearby date mentions using dependency parsing and rule-based contextual heuristics) [53-58] or simply classifying identified events into broad temporal categories such as current, history, future, or unknown [59,60].

Figure 6B shows the evolution of clinical NLP applications over time. IE remained the predominant task throughout the years, followed by diagnostic classification. Newer applications introduced after 2018 include concept normalization, prognostic classification, and topic modeling. Task chaining, where the output of one task is used as the input for downstream tasks, was common in studies that went beyond IE. For example, in 2014, there were 4 publications of NLP applied to clinical notes. All 4 (100%) studies used NLP to extract information of some kind, 2 (50%) studies used the extracted information to match patients to clinical trials, 1 (25%) study used the extracted information for diagnostic classification, and 1 (25%) study had IE as the end point. Discrete models were almost exclusively used for these downstream tasks.

System Deployment Stage and Clinical Impact

Of the 226 reviewed studies, 224 (99.1%) developed proof-of-concept systems that were evaluated only in research settings rather than deployed in routine clinical practice. One study piloted their system in clinical practice [61], while another described the use of an NLP-based system that had just been integrated in clinical use [62].

Since most studies were evaluated only as research implementations (ie, no real-world deployments), clinical impact was not evaluated. However, 25 of the 226 (11.1%) studies compared their systems with current practice in their research implementation. These studies reported benefits such as improved data coverage (identified more patients with the relevant attribute than from structured data alone) and completeness (curated further variables not available as structured data) [63-75], taking less time to extract relevant information [29,61,76-82], fewer clinician man-hours for certain tasks (eg, fewer clinicians needed to complete clinical audits) [29], and higher classification or prediction accuracy compared with human experts or existing methods [76,83,84]. One study that described an IE system in routine use [62] focused on characterizing use patterns, including which clinical specialties used the system and for what purposes.

Challenges and Limitations Reported by the Authors

Table 3 details challenges and limitations faced by researchers applying different NLP techniques to clinical notes. Common challenges were single-institution corpora (39/226, 17.3%), limited data (18/226, 8%), incomplete EHR data (14/226, 6.2%), label imbalance (12/226, 5.3%), rules or dictionary not comprehensive or generalizable (9/226, 4%), and word sense and abbreviation disambiguation (6/226, 2.7%). Overall, authors reported a range of challenges, some unique to the task, corpora, or methodological approach.

Table 3.

Challenges and limitations reported in studies^a.

Challenge or limitation	Values, n (%)
Single institution corpus [37,42,44,61,70,72,73,75,77,79,82,84-111]	39 (17.3)
Limited data [32,50,57,61,62,65,79,90,92,104,111-117]	18 (8)
Incomplete recording in the EHR^b [42,57,74,78,81,94,98,103,109,118-122]	14 (6.2)
Label imbalance [31,38,44,73,82,98,102,104,123-126]	12 (5.3)
Negation detection and resolution [41,74,97,119,126-131]	10 (4.4)
Dictionary or rules not comprehensive or generalizable [65,66,92,119,120,132-135]	9 (4)
Word sense or abbreviation disambiguation [48,130,131,136-138]	6 (2.7)
Variability in terminology used to describe the same concept [78,120,136,139]	4 (1.8)
Spelling errors or typos [90,130,137,140]	4 (1.8)
Imbalanced data [57,102,105,141]	4 (1.8)
Use of speculative language [117,128,136]	3 (1.3)
Use of nonstandard terminology [90,128,142]	3 (1.3)
Rarity of concepts of interest [41,45,143]	3 (1.3)
Institutional differences in documentation style or note structure [42,81,117]	3 (1.3)
Quality of human annotations [51,80]	2 (0.9)
Multilingualism in text [79,128]	2 (0.9)
Temporal reasoning (current vs historical events) [129,138]	2 (0.9)
Short notes or sentences (insufficient context for context-dependent models) [72,144]	2 (0.9)
Model computationally expensive [38]	1 (0.4)
Distant (intersentence) relations [124]	1 (0.4)
Frequency of co-occurrence of unrelated concepts [143]	1 (0.4)
Long execute-response time [145]	1 (0.4)
Very long documents ( >512 token limit for BERT^c-based models) [125]	1 (0.4)
Significant n-gram method insensitive to evolution of patient’s notes over time and between patients [146]	1 (0.4)
Resolution of patient and nonpatient references [97]	1 (0.4)
Nonstandard date formats [57]	1 (0.4)

^aNegation detection and resolution includes detecting the negation itself, distant negations, and resolving the scope of the negation. Limited data encompass the following: small corpus, only a small number of patients associated with those notes, and small annotated or labeled notes for model development and evaluation. Imbalanced data refer to instances where notes are overrepresented by text from one patient group (eg, private insurance vs noninsured). Label imbalance is when one label of interest (eg, a certain biomarker) is more prevalent in the notes, hence, easily learned by the model at the expense of other labels (biomarkers). Quality of human annotations is where human annotated corpora for model training and evaluation are erroneous.

^bEHR: electronic health record.

^cBERT: Bidirectional Encoder Representations from Transformers.

DiscussionSummary of Main Findings

Research applying NLP to clinical notes in the cancer domain grew substantially during the review period, rising from 4 publications in 2014 to 43 in 2023, likely driven by the increasing availability of digital records and advances in scalable NLP methods. However, most studies relied on English language (156/226, 69%) and single-institution (161/226, 71.2) datasets. The majority of studies originated from the United States (133/226, 58.8%), which aligns with trends in clinical NLP publishing in which the United States dominates [147]. Almost half of the studies (110/226, 48.7%) provided no information on the characteristics of patients whose clinical notes were used, while 56.6% (128/226) did not provide a statement on data sharing, limiting interpretability and reproducibility. The most commonly studied cancers (breast, lung, colorectal, and prostate) likely reflect their prevalence in the United States and hence dedicated EHR systems, which in turn increases the availability of clinical notes.

NLP methods for processing clinical notes evolved from exclusively ontology-based, rule-based, and discrete models (2014‐2017) to hybrid approaches incorporating neural networks and PLMs such as BERT (2018‐2024). Only a few studies applied LLMs, with publications starting from October 2023. Contextual embeddings have become increasingly prevalent, reflecting the wider adoption of pretrained models. Most studies used small single-institution datasets (<1000 documents or <1000 patients), likely due to challenges in accessing clinical notes. Annotation methods were mostly manual. A subanalysis of non-English corpora studies showed that the majority (59/70, 84.3%) implemented language-specific, nonpretrained models. Domain-specific pretrained clinical models were superior to other model types in the majority (11/16, 68.8%) of studies across both English and non-English corpora. Only 9.3% (21/226) of studies evaluated their systems on external datasets.

Most studies (174/226, 77%) focused on IE. A subset of these used the extracted information in downstream tasks, but the majority (115/226, 50.9%) focused solely on IE. In total, 15 studies extracted temporal information from clinical notes using various approaches, including DocTimeRel classification, event-time relation classification, and proximity- or context-based methods. No studies evaluated clinical impact following implementation, but several studies compared their systems to current practice in their respective settings (eg, manual review of notes in clinical audits) and demonstrated potential clinical utility. The most common challenge in clinical NLP was restricted access to sufficient clinical notes, reported by 17.3% (39/226) of studies.

Evolution of NLP Methods for Clinical Notes

NLP methods for clinical notes have become more diverse over time. While new deep learning–based techniques have gained popularity, they have largely complemented rather than replaced traditional methods such as rules and ontologies, resulting in widespread adoption of hybrid architectures. Prior reviews that included substantial volumes of clinical notes reported similar findings, namely, the predominance of rule-based methods alongside increasing use of hybrid architectures that combine rules with machine learning or neural networks [24,25]. However, a review of NLP applied to diagnostic (radiology) reports reported slightly different findings, with rule-based and classical machine learning methods being prevalent but often used as baselines against which deep learning approaches were compared [148].

The continued use of rule-based approaches for clinical notes likely reflects the unique challenges posed by these documents, which often require substantial preprocessing before neural models can be applied, as well as postprocessing to structure model outputs into clinically meaningful formats. The overall prevalence of rule-based methods may also partly reflect the inclusion of semistructured diagnostic reports, which—owing to their templated design and restricted, domain-specific vocabulary—are generally more amenable to rule-based processing [149]. Combining knowledge resources with deep neural models, on the other hand, may reflect authors’ efforts to enhance the explainability of predictions made by these complex networks, given the importance of explainability in health care AI [150,151] and evidence from prior work that integrating knowledge into deep learning may improve explainability [152].

Text representation methods have evolved alongside machine learning models. Earlier NLP approaches commonly relied on discrete word representations, such as term frequency-inverse document frequency and n-grams [153]. Our review shows that context-free word embeddings (eg, Word2Vec, GloVe, and FastText) were the most widely used, typically with classical machine learning models. The results also suggest that these approaches are increasingly being complemented or replaced by contextual embeddings derived from transformer-based models, which represent words as vectors that capture richer semantic and syntactic relationships.

Trends in NLP Clinical Applications

NLP applications to clinical notes focused predominantly on IE, accounting for over three-quarters of included studies, with comparatively limited use in downstream clinical decision-making tasks. This emphasis reflects both the pragmatic advantages and the perceived safety of IE. By structuring free-text data into clinically meaningful variables, IE enables expert oversight, produces interpretable intermediate outputs, and supports a broad range of secondary applications, including diagnostic or prognostic modeling, cohort identification, and decision support [154]. In contrast, approaches that predict outcomes directly from unstructured text without an explicit IE step are often less transparent, constrain the incorporation of domain knowledge, and are typically optimized for a single task [155].

Potential Clinical Impact of NLP

Although none of the included studies evaluated the direct clinical impact of NLP systems on patient care following research implementation, several studies compared their systems with current clinical practice as part of their evaluation. These comparisons demonstrated the potential of NLP to support tasks such as IE, clinical auditing, and diagnostic or prognostic classification. However, most studies (161/226, 71.2%) relied on small, single-institution datasets, raising concerns about generalizability, as such models often perform less well when applied to more representative or external datasets due to differences in both population characteristics and data structure. Without extensive evaluation across diverse datasets, there remains limited evidence of real-world effectiveness, thereby impeding adoption into routine clinical use.

Beyond technical performance, the application of NLP systems to high-risk tasks, such as cancer diagnosis or risk prediction, is subject to stringent regulatory oversight as medical devices [156,157]. These regulatory requirements, together with challenges in integrating NLP systems into existing clinical workflows [158], further hinder translation into routine clinical care and help explain the limited real-world impact observed across studies.

Challenges and Opportunities in Advancing Clinical NLP

Our findings indicate that restricted access to clinical data remains the dominant barrier in oncology NLP. Access to clinical corpora is complicated by multiple barriers, including national data protection regulations governing privacy and confidentiality (eg, the General Data Protection Regulation [159] in the European Union and the Health Insurance Portability and Accountability Act [160] in the United States), additional institutional governance restrictions imposed to mitigate disclosure risk and legal liability [161], and technical obstacles such as EHR interoperability [161]. This is compounded by limited data sharing practices, with many studies providing no clear data availability statement or listing data as “available on reasonable request,” a practice that often creates substantial practical barriers, including low response rates and protracted negotiations that effectively limit access. As a result, researchers have to rely on small, single-institution datasets, resulting in proof-of-concept systems with limited generalizability.

Limited data accessibility undermines reproducibility, hinders meaningful comparison across studies, prevents the establishment of standardized benchmarks for performance evaluation, and reinforces reliance on small, single-institution datasets. Collectively, these challenges derail real-world deployment of clinical NLP systems.

Several methodological approaches have attempted to mitigate data scarcity, each with notable limitations. Transfer learning through clinical PLMs (eg, ClinicalBERT) is constrained by training on relatively small and institutionally narrow corpora, reflecting the same access limitations they aim to overcome, which can result in suboptimal performance on downstream tasks [162,163]. Publicly available deidentified datasets curated for clinical NLP shared tasks (eg, Cancer Text Mining Shared Task [164]) face similar limitations, being small and single-center.

More recently, LLMs have shown promise in mitigating data scarcity by enabling zero-shot or few-shot learning, thereby reducing dependence on large, manually annotated corpora [165,166]. However, LLMs introduce additional challenges, including the propagation of embedded biases [167], privacy breaches [168], model obsolescence and drift [168], hallucination and confidently stated falsehoods [169,170], and substantial computational and environmental costs. These shortcomings can be detrimental to clinical practice, for example, by systematically underrecommending investigations, procedures, or treatments for underrepresented patient groups. Therefore, research on LLMs should also focus on addressing these ethical concerns in addition to technical performance and generalizability.

Model-centric privacy-preserving approaches, such as federated learning, where models are trained locally and aggregated without sharing raw data [171], offer a potential pathway toward multi-institutional collaboration without direct data transfer. However, practical deployment remains challenging, requiring compatible infrastructure, sustained institutional partnerships, and strategies to manage data heterogeneity and site imbalance, which can bias global models toward dominant contributors and degrade performance for underrepresented populations [172]. Related techniques, such as differential privacy, may further reduce reidentification risk but introduce trade-offs between privacy protection and model utility that must be carefully managed [173].

Beyond algorithmic solutions, structural and policy-level interventions are likely to be critical. National initiatives, such as those implemented in Denmark, where clinical notes are rigorously deidentified and made accessible within secure research environments [69], demonstrate the feasibility of balancing privacy protection with research utility. Broader adoption of such frameworks, alongside clearer institutional agreements that permit sharing of rigorously deidentified clinical text and accompanying code, could substantially improve reproducibility and accelerate progress in oncology NLP. Furthermore, to support open science in oncology, NLP future studies should adopt more transparent reporting of data access conditions, and where feasible, publicly release the code, alongside clear governance mechanisms to balance reproducibility with patient privacy.

Limitations of the Review

This review has several limitations. First, approximately half of the included studies analyzed clinical notes alongside more structured medical documents such as pathology or radiology reports. These document types differ substantially in linguistic complexity, with diagnostic reports often being more templated and semistructured compared to free-text clinical notes such as progress notes or discharge summaries. As a result, the NLP methods and challenges reported in such studies may not be fully representative of those encountered when analyzing highly unstructured clinical narratives.

Second, we were unable to determine the proportion of clinical notes versus other document types in each study, as this was rarely reported. While we distinguished document types where possible, inconsistent reporting limited further quantification of these documents. Consequently, our findings reflect the broader landscape of clinical text processing in oncology rather than exclusively characterizing NLP applied to highly unstructured clinical notes. Nonetheless, we provide a more faithful representation of pregenerative AI methodological choices and challenges associated with clinical notes, as all included studies incorporated clinical notes. In addition, we could not systematically compare model performance across studies due to substantial heterogeneity in corpora and NLP tasks.

Third, the predominance of studies authored by researchers from the United States (133/226, 58.8%), primarily using local datasets, may have introduced some geographical and system-level bias. Our findings are therefore more reflective of the US health care context, including workflows, documentation styles, clinical note structures, and data access provisions.

Finally, Cohen κ for title or abstract screening (0.54) and full-text screening (0.58) indicated moderate interrater agreement. This primarily reflects challenges in operationalizing eligibility criteria. In particular, disagreement frequently arose from ambiguity in how studies described their textual data sources, as some authors used the term “clinical notes” broadly to refer to any textual medical document, including diagnostic reports. This was exacerbated by limited methodological detail in the abstract, making it difficult to determine whether clinical notes were included. Despite this, class-specific agreement for exclusions at title or abstract screening was high (97.9%), while agreement for included studies improved substantially at full-text screening (86.3%) once detailed information was available. The moderate κ values could therefore be partly attributed to class imbalance inherent to evidence synthesis, as most records are excluded at the title or abstract, and κ adjusts for agreement expected by chance.

Conclusions

This review establishes a comprehensive pregenerative AI baseline for NLP applied to clinical notes in oncology. Over the past decade, research volume increased substantially, and methods evolved from rule-based approaches to hybrid architectures incorporating rules and neural networks, including PLMs. However, most studies focused on IE rather than diagnosis or prognostication, relied on small single-institution datasets, and lacked external validation. While several systems demonstrated superior performance compared to current practice in research settings, significant barriers to clinical deployment remain, including limited generalizability, poor reproducibility, and restricted data access. Emerging generative AI approaches will need to address these barriers, as well as broader ethical challenges, to enable the translation of NLP systems into clinical settings for real-world impact.

The authors thank Paula Funnell (Academic Skills and Liaison Librarian, Faculty of Medicine and Dentistry, Queen Mary University of London, Whitechapel Campus) for her assistance in developing the search strategy. The authors used ChatGPT (a generative artificial intelligence tool developed by OpenAI) to refine the Python code used to plot figures and to edit selected sections of the manuscript to improve grammar, sentence structure, and brevity. All outputs (code and text) were checked and, where necessary, revised by the authors.

Funding

This study was conducted without any funding. However, the first author (ABK) completed this study as part of his PhD funded by the Wellcome Trust through the Health Data in Practice Doctoral Training Programme at Queen Mary University of London (grant 218584/Z/19/Z).

Data Availability

All data generated and analyzed during this study are included in this published paper as Multimedia Appendix 2.

ABK conceptualized and designed the study under the supervision of GF. KL, HRAE, FMW, and CC reviewed the study methodology. ABK performed the database searches and reference retrieval. ABK and HRAE completed the title or abstract screening, full-text screening, and data extraction, and analyzed and interpreted the data. ABK drafted the manuscript, and GF, HRAE, KL, FMW, and CC reviewed the draft. ABK revised the manuscript. All authors read and approved the final manuscript.

None declared.

Abbreviations

artificial intelligence

BERT

Bidirectional Encoder Representations from Transformers

CNN

convolutional neural network

EHR

electronic health record

ICD

International Classification of Diseases

information extraction

LLM

large language model

NLP

natural language processing

PLM

pretrained language model

PRISMA-ScR

Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews

RNN

recurrent neural network

UMLS

Unified Medical Language System

References1

GLOBOCAN 2020: new global cancer data

UICC2023-12-19

https://www.uicc.org/news/globocan-2020-new-global-cancer-data

Worldwide cancer incidence statistics

Cancer Research UK2023-12-19

https://www.cancerresearchuk.org/health-professional/cancer-statistics/worldwide-cancer/incidence#heading-One

Kim

Rubinstein

Nead

Wojcieszynski

Gabriel

Warner

The evolving use of electronic health records (EHR) for research

Semin Radiat Oncol201910294354361

10.1016/j.semradonc.2019.05.010

31472738

Structured vs unstructured data in healthcare

HealthTech2025-01-07

https://healthtechmagazine.net/article/2023/05/structured-vs-unstructured-data-in-healthcare-perfcon#:~:text=Bring%20order%20to%20unstructured%20data,two%20dozen%20ICD-10%20codes

Tayefi

Ngo

Chomutare

Challenges and opportunities beyond structured data in analysis of electronic health records

WIREs Comput Stats202111136

10.1002/wics.1549

Meystre

Savova

Kipper-Schuler

Hurdle

Extracting information from textual documents in the electronic health record: a review of recent research

Yearb Med Inform200817128144

10.1055/s-0038-1638592

18660887

Perera

Sheth

Thirunarayan

Challenges in understanding clinical notes: why NLP engines fall short and where background knowledge can help

International Conference on Information and Knowledge Management, Proceedings

Nov 3-7, 2013

10.1145/2512410.2512427

Madan

Lentzen

Brandt

Rueckert

Hofmann-Apitius

Fröhlich

Transformer models in biomedicine

BMC Med Inform Decis Mak20240729241214

10.1186/s12911-024-02600-5

39075407

Klotzman

Kunze

Torre

Riccoboni

The difficulties of clinical NLP

Engineering Mathematics and Artificial Intelligence: Foundations, Methods, and Applications2023

CRC Press

413423

10.1201/9781003283980-17

9781032255675

Devlin

Chang

Lee

Toutanova

BERT: pre-training of deep bidirectional transformers for language understanding

NAACL HLT 2019—2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies—Proceedings of the Conference

Jun 2-7, 2019

10.18653/v1/N19-1423

Brown

Mann

Ryder

Language models are few-shot learners

2026-04-24

34th Conference on Neural Information Processing Systems (NeurIPS 2020)

Dec 8-10, 2020

https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

Touvron

Lavril

Izacard

LLaMA: open and efficient foundation language models

Preprint posted online on Feb 27, 2023

10.48550/arXiv.2302.13971

Tariq

Sikha

Kurian

Open-source hybrid large language model integrated system for extraction of breast cancer treatment pathway from free-text clinical notes

JCO Clin Cancer Inform20250699e2500002

10.1200/CCI-25-00002

40577660

Naeem

SBDH-Reader: a large language model-powered method for extracting social and behavioral determinants of health from clinical notes

J Am Med Inform Assoc2025101321015701580

10.1093/jamia/ocaf124

Kaster

Hillis

Comparison of rule- and large language model-based phenotype extraction from clinical notes for neurofibromatosis type 1

J Am Med Inform Assoc2025111321116631673

10.1093/jamia/ocaf155

Chen

Alnassar

Avison

Huang

Raman

Large language model applications for health information extraction in oncology: scoping review

JMIR Cancer2025032811e65984

10.2196/65984

40153782

Zhong

Chen

Large language models in lung cancer: systematic review

J Med Internet Res2025093027e74177

10.2196/74177

41026980

Hao

Qiu

Holmes

Large language model integrations in cancer decision-making: a systematic review and meta-analysis

NPJ Digit Med2025071781450

10.1038/s41746-025-01824-7

40676129

Wang

Wen

Assessment of electronic health record for cancer research and patient care through a scoping review of cancer natural language processing

JCO Clin Cancer Inform2022076e2200006

10.1200/CCI.22.00006

35917480

Zhang

Weng

Wang

Natural language processing applications for computer-aided diagnosis in oncology

Diagnostics (Basel)20230112132286

10.3390/diagnostics13020286

36673096

Gholipour

Khajouei

Amiri

Hajesmaeel Gohari

Ahmadian

Extracting cancer concepts from clinical notes using natural language processing: a systematic review

BMC Bioinformatics20231029241405

10.1186/s12859-023-05480-0

37898795

Sangariyavanich

Ponthongmak

Tansawet

Systematic review of natural language processing for recurrent cancer detection from electronic medical records

Inform Med Unlocked202341101326

10.1016/j.imu.2023.101326

Wang

Rastegar-Mojarad

Clinical information extraction applications: a literature review

J Biomed Inform201801773449

10.1016/j.jbi.2017.11.011

29162496

Sim

Huang

Horan

Natural language processing with machine learning methods to analyze unstructured patient-reported outcomes derived from electronic health records: a systematic review

Artif Intell Med202312146102701

10.1016/j.artmed.2023.102701

38042599

Sheikhalishahi

Miotto

Dudley

Lavelli

Rinaldi

Osmani

Natural language processing of clinical notes on chronic diseases: systematic review

JMIR Med Inform2019042772e12239

10.2196/12239

31066697

Tricco

Lillie

Zarin

PRISMA Extension for Scoping Reviews (PRISMA-ScR): checklist and explanation

Ann Intern Med20181021697467473

10.7326/M18-0850

30178033

Munn

Peters

MDJ

Stern

Tufanaru

McArthur

Aromataris

Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach

BMC Med Res Methodol20181119181143

10.1186/s12874-018-0611-x

30453902

Sultan

Al-Abdallat

Alnajjar

Using ChatGPT to predict cancer predisposition genes: a promising tool for pediatric oncologists

Cureus2023101510e47594

10.7759/cureus.47594

38021917

McGowan

Correia Martins

Keen

Can natural language processing be effectively applied for audit data analysis in gynaecological oncology at a UK cancer centre?

Int J Med Inform202402182105306

10.1016/j.ijmedinf.2023.105306

38065003

Solarte-Pabon

Blazquez-Herranz

Torrente

Rodriguez-Gonzalez

Provencio

Menasalvas

Extracting cancer treatments from clinical text written in Spanish: a deep learning approach

2021

2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)

Oct 6-9, 2021

10.1109/DSAA53316.2021.9564137

Paolo

Bria

Greco

Named entity recognition in Italian lung cancer clinical reports using transformers

2023

2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Dec 5-8, 2023

Istanbul, Turkiye

2023

10.1109/BIBM58861.2023.10385778

Zhang

Extracting comprehensive clinical information for breast cancer using deep learning methods

Int J Med Inform201912132103985

10.1016/j.ijmedinf.2019.103985

31627032

Rivera-Zavala

Martinez

Deep neural model with contextualized-word embeddings for named entity recognition in Spanish clinical text

Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) CEUR Workshop Proceedings (CEUR-WS.org)

Sep 22-24, 2024

Zelina

Halámková

Nováček

Extraction, labeling, clustering, and semantic mapping of segments from clinical notes

IEEE Transon Nanobioscience2023224781788

10.1109/TNB.2023.3275195

Araki

Matsumoto

Togo

Developing artificial intelligence models for extracting oncologic outcomes from Japanese electronic health records

Adv Ther202303403934950

10.1007/s12325-022-02397-7

36547809

García-Pablos

Perez

Vicomtech at CANTEMIST 2020

Proceedings of the Iberian Languages Evaluation Forum (IberLEF2020) CEUR Workshop Proceedings (CEUR-WS.org)

Sep 22-24, 2020

Karlsson

Ellonen

Irjala

Impact of deep learning-determined smoking status on mortality of cancer patients: never too late to quit

ESMO Open20210663100175

10.1016/j.esmoop.2021.100175

34091262

Chapman

Neumann

Automatic ICD code classification with label description attention mechanism

Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) CEUR Workshop Proceedings (CEUR-WS.org)

Sep 22-24, 2020

Solarte-Pabón

Montenegro

García-Barragán

Transformers for extracting breast cancer information from Spanish clinical narratives

Artif Intell Med202309143102625

10.1016/j.artmed.2023.102625

37673566

Osborne

O’leary

Del

Identification of cancer entities in clinical text combining transformers with dictionary features

Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) CEUR Workshop Proceedings (CEUR-WS.org)

Sep 22-24, 2020

Solarte Pabón

Montenegro

Torrente

Rodríguez González

Provencio

Menasalvas

Negation and uncertainty detection in clinical texts written in Spanish: a deep learning-based approach

PeerJ Comput Sci20228e913

10.7717/peerj-cs.913

Banerjee

Gensheimer

Wood

Probabilistic prognostic estimates of survival in metastatic cancer patients (PPES-Met) utilizing free-text clinical narratives

Sci Rep20180738110037

10.1038/s41598-018-27946-5

29968730

Gray

Ottesen

Currey

Leveraging an informatics approach to identify an unmet clinical need for BRCA1/2 testing among patients with ovarian cancer

JCO Clin Cancer Inform2022096e2200034

10.1200/CCI.22.00034

36049148

Kaka

Michalopoulos

Subendran

Pretrained neural networks accurately identify cancer recurrence in medical record

Stud Health Technol Inform202205252949397

10.3233/SHTI220403

35612023

Banerjee

Bozkurt

Caswell-Jin

Kurian

Rubin

Natural language processing approaches to detect the timeline of metastatic recurrence of breast cancer

JCO Clin Cancer Inform2019103112

10.1200/CCI.19.00034

31584836

Velupillai

Mowery

Abdelrahman

Christensen

Chapman

BluLab: temporal information extraction for the 2015 clinical tempeval challenge

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

Jun 4-5, 2015

10.18653/v1/S15-2137

Miller

Laparra

Bethard

Ben-David

Cohen

McDonald

Domain adaptation in practice: lessons from a real-world information extraction pipeline

Proceedings of the Second Workshop on Domain Adaptation for NLP

Aug 1, 2019

Hong

Davoudi

Mowery

Annotation and extraction of age and temporally-related events from clinical histories

BMC Med Inform Decis Mak2020123020Suppl 11338

10.1186/s12911-020-01333-5

33380319

Long

Wang

A system for automatically extracting clinical events with temporal information

BMC Med Inform Decis Mak202012201113

10.1186/s12911-020-01208-9

Bitterman

Goldner

Finan

An end-to-end natural language processing system for automatically extracting radiation therapy events from clinical texts

Int J Radiat Oncol Biol Phys20230911171262273

10.1016/j.ijrobp.2023.03.055

36990288

Adamson

Waskom

Blarre

Approach to machine learning for extraction of real-world data variables from electronic health records

Front Pharmacol2023141180962

10.3389/fphar.2023.1180962

37781703

Raghavan

Chen

Fosler-Lussier

Lai

How essential are unstructured clinical narratives and information fusion to clinical trial recruitment?

AMIA Jt Summits Transl Sci Proc20142014218218223

25717416

Solarte Pabón

Torrente

Provencio

Rodríguez-Gonzalez

Menasalvas

Integrating speculation detection and deep learning to extract lung cancer diagnosis from clinical notes

Appl Sci (Basel)2021112865

10.3390/app11020865

Guin

Jun

Patel

Extraction of treatment information from electronic health records and evaluation of testosterone recovery in patients with prostate cancer

JCO Clin Cancer Inform2022066e2200010

10.1200/CCI.22.00010

35696627

Najafabadipour

Zanin

Rodríguez-González

Reconstructing the patient’s natural history from electronic health records

Artif Intell Med202005105101860

10.1016/j.artmed.2020.101860

32505419

Wang

Wampfler

Dispenzieri

Yang

Liu

Achievability to extract specific date information for cancer research

AMIA Annu Symp Proc20192019893893902

32308886

Sholle

Krichevsky

Scandura

Campion

Extracting and classifying diagnosis dates from clinical notes: a case study

J Biomed Inform202010110103569

10.1016/j.jbi.2020.103569

32949781

Solarte-Pabon

Torrente

Rodriguez-Gonzalez

Provencio

Menasalvas

Tunas

Lung cancer diagnosis extraction from clinical notes written in Spanish

2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS)

Jul 28-30, 2020

10.1109/CBMS49503.2020.00099

Rumeng

Abhyuday N

Hong

A hybrid neural network model for joint prediction of presence and period assertions of medical events in clinical notes

AMIA Annu Symp Proc20172017114911491158

29854183

Palmer

Hassanpour

Higgins

Doherty

Onega

Building a tobacco user registry by extracting multiple smoking behaviors from clinical notes

BMC Med Inform Decis Mak20190725191141

10.1186/s12911-019-0863-3

31340796

Feld

A natural language processing-assisted extraction system for Gleason scores: development and usability study

JMIR Cancer202107273e27970

10.2196/27970

34255641

Biron

Metzger

Pezet

Sebban

Barthuet

Durand

An information retrieval system for computerized patient records in the context of a daily hospital practice: the example of the Léon Bérard Cancer Center (France)

Appl Clin Inform201451191205

10.4338/ACI-2013-08-CR-0065

24734133

Zhu

Teh

Armenian

Knowledge extraction of long-term complications from clinical narratives of blood cancer patients with HCT treatments

20180815

BCB ’18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Aug 29 to Sep 1, 2018

10.1145/3233547.3233635

Osborne

Wyatt

Westfall

Willig

Bethard

Gordon

Efficient identification of nationally mandated reportable cancer cases using natural language processing and machine learning

J Am Med Inform Assoc20161123610771084

10.1093/jamia/ocw006

27026618

Cohen

Rosic

Harrison

A natural language processing algorithm to improve completeness of ECOG performance status in real-world data

Appl Sci (Basel)202313106209

10.3390/app13106209

Tamang

Patel

Blayney

Detecting unplanned care from clinician notes in electronic health records

J Oncol Pract201505113e3139

10.1200/JOP.2014.002741

25980019

Karimi

Blayney

Kurian

Development and use of natural language processing for identification of distant cancer recurrence and sites of distant recurrence using unstructured electronic health record data

JCO Clin Cancer Inform20210455469478

10.1200/CCI.20.00165

33929889

Hernandez-Boussard

Tamang

Blayney

Brooks

Shah

New paradigms for patient-centered outcomes research in electronic medical records: an example of detecting urinary incontinence following prostatectomy

EGEMS (Wash DC)2016431231

10.13063/2327-9214.1231

27347492

Hjaltelin

Novitski

Jørgensen

Pancreatic cancer symptom trajectories from Danish registry data and free text in electronic health records

Elife2023112112e84919

10.7554/eLife.84919

37988407

Wang

Ruan

Yang

Liu

Comparison of three information sources for smoking information in electronic health records

Cancer Inform201615237242

10.4137/CIN.S40604

27980387

Prado

Kessler

Symptoms and signs of lung cancer prior to diagnosis: case-control study using electronic health records from ambulatory care within a large US-based tertiary care centre

BMJ Open20230420134e068832

10.1136/bmjopen-2022-068832

37080616

Shi

Morgan

Bradshaw

Identifying patients who meet criteria for genetic testing of hereditary cancers based on structured and unstructured family health history data in the electronic health record: natural language processing approach

JMIR Med Inform20220811108e37842

10.2196/37842

35969459

Bozkurt

Magnani

Seneviratne

Brooks

Hernandez-Boussard

Expanding the secondary use of prostate cancer real world data: automated classifiers for clinical and pathological stage

Front Digit Health20224793316

10.3389/fdgth.2022.793316

35721793

Breitenstein

Liu

Maxwell

Pathak

Zhang

Electronic health record phenotypes for precision medicine: perspectives and caveats from treatment of breast cancer at a single institution

Clin Transl Sci2018011118592

10.1111/cts.12514

29084368

Liu

McCoy

Aldrich

Leveraging natural language processing to identify eligible lung cancer screening patients with the electronic health record

Int J Med Inform202309177105136

10.1016/j.ijmedinf.2023.105136

37392712

Schiappa

Contu

Culie

Validation of RUBY for breast cancer knowledge extraction from a large French electronic medical record system

JCO Clin Cancer Inform2023057e2200130

10.1200/CCI.22.00130

37235837

Beck

Rammage

Jackson

Artificial intelligence tool for optimizing eligibility screening for clinical trials in a large community cancer center

JCO Clin Cancer Inform20200145059

10.1200/CCI.19.00079

31977254

Lindvall

Lilley

Zupanc

Natural language processing to assess end-of-life quality indicators in cancer patients receiving palliative surgery

J Palliat Med201902222183187

10.1089/jpm.2018.0326

30328764

Gauthier

Law

Automating access to real-world evidence

JTO Clin Res Rep20220636100340

10.1016/j.jtocrr.2022.100340

35719866

Lin

Zwolinski

JTY

Machine learning-based natural language processing to extract PD-L1 expression levels from clinical notes

Health Informatics J202329314604582231198021

10.1177/14604582231198021

37635280

Lindvall

Deng

Moseley

Natural language processing to identify advance care planning documentation in a multisite pragmatic clinical trial

J Pain Symptom Manage202201631e29e36

10.1016/j.jpainsymman.2021.06.025

34271146

Wang

Cui

Zhu

Evaluation of an artificial intelligence-based clinical trial matching system in Chinese patients with hepatocellular carcinoma: a retrospective study

BMC Cancer202424117

10.1186/s12885-024-11959-7

Lin

FPY

Salih

OSM

Scott

Jameson

Epstein

Development and validation of a machine learning approach leveraging real-world clinical narratives as a predictor of survival in advanced cancer

JCO Clin Cancer Inform202210PMID636265112

10.1200/CCI.22.00064

Gensheimer

Aggarwal

Benson

KRK

Automated model versus treating physician for predicting survival time of patients with metastatic cancer

J Am Med Inform Assoc2021061228611081116

10.1093/jamia/ocaa290

Moseley

Welt

A corpus for detecting high-context medical conditions in intensive care patient notes focusing on frequently readmitted patients

Preprint posted online on Mar 6, 2020

10.48550/arXiv.2003.03044

Development and validation of method for defining conditions using Chinese electronic medical record

BMC Med Inform Decis Mak2016082016110

10.1186/s12911-016-0348-6

27542973

Poort

Zupanc

Leiter

Wright

Lindvall

Documentation of palliative and end-of-life care process measures among young adults who died of cancer: a natural language processing approach

J Adolesc Young Adult Oncol20200291100104

10.1089/jayao.2019.0040

31411524

Ernecoff

Wessell

Hanson

Electronic health record phenotypes for identifying patients with late-stage disease: a method for research and clinical application

J Gen Intern Med201912341228182823

10.1007/s11606-019-05219-9

31396813

Warner

Levy

Neuss

Warner

Levy

Neuss

ReCAP: feasibility and accuracy of extracting cancer stage information from narrative electronic health record data

J Oncol Pract201602122157158

10.1200/JOP.2015.004622

26306621

Chen

Song

Shao

Ding

Using natural language processing to extract clinically useful information from Chinese electronic medical records

Int J Med Inform201904124612

10.1016/j.ijmedinf.2019.01.004

30784428

Kondratieff

Brown

Barron

Warner

Yin

Mining medication use patterns from clinical notes for breast cancer patients through a two-stage topic modeling approach

AMIA Jt Summits Transl Sci Proc20222022303303312

35854740

Hong

Fairchild

Tanksley

Palta

Tenenbaum

Natural language processing for abstraction of cancer treatment toxicities: accuracy versus human experts

JAMIA Open2021021534513517

10.1093/jamiaopen/ooaa064

Gregg

Lang

Wang

Automating the determination of prostate cancer risk strata from electronic medical records

JCO Clin Cancer Inform20171118

10.1200/CCI.16.00045

29541700

Banerjee

Magnani

Blayney

Brooks

Hernandez-Boussard

Clinical documentation to predict factors associated with urinary incontinence following prostatectomy for prostate cancer

Res Rep Urol202012714

10.2147/RRU.S234178

32158720

Bozkurt

Kan

Ferrari

Is it possible to automatically assess pretreatment digital rectal examination documentation using natural language processing? A single-centre retrospective study

BMJ Open2019071897e027182

10.1136/bmjopen-2018-027182

31324681

Laios

Kalampokis

Mamalis

RoBERTa-assisted outcome prediction in ovarian cancer cytoreductive surgery using operative notes

Cancer Control20233010732748231209892

10.1177/10732748231209892

37915208

Joffe

Pettigrew

Herskovic

Bearden

Bernstam

Expert guided natural language processing using one-class classification

J Am Med Inform Assoc201509225962966

10.1093/jamia/ocv010

26063744

Coquet

Bozkurt

Kan

Comparison of orthogonal NLP methods for clinical phenotyping and assessment of bone scan utilization among prostate cancer patients

J Biomed Inform20190694103184

10.1016/j.jbi.2019.103184

31014980

Bozkurt

Park

Kan

An automated feature engineering for digital rectal examination documentation using natural language processing

AMIA Annu Symp Proc20182018288288294

30815067

100

Sanyal

Tariq

Kurian

Rubin

Banerjee

Weakly supervised temporal model for prediction of breast cancer distant recurrence

Sci Rep20210541119461

10.1038/s41598-021-89033-6

33947927

101

Kehl

Gusev

Artificial intelligence-aided clinical annotation of a large multi-cancer genomic dataset

Nat Commun202112151217304

10.1038/s41467-021-27358-6

34911934

102

Chen

Guevara

Ramirez

Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy

JCO Clin Cancer Inform2023077e2300048

10.1200/CCI.23.00048

37506330

103

Lindvall

Deng

Agaronnik

Deep learning for cancer symptoms monitoring on the basis of electronic health record unstructured clinical notes

JCO Clin Cancer Inform2022066e2100136

10.1200/CCI.21.00136

35714301

104

Yim

Kwan

Johnson

Yetisgen

Classification of hepatocellular carcinoma stages from free-text clinical and radiology reports

AMIA Annu Symp Proc20172017185818581867

29854257

105

Derton

Guevara

Chen

Natural language processing methods to empirically explore social contexts and needs in cancer patient notes

JCO Clin Cancer Inform2023057e2200196

10.1200/CCI.22.00196

37235847

106

Khor

Nguyen

O’Dwyer

Extracting tumour prognostic factors from a diverse electronic record dataset in genito-urinary oncology

Int J Med Inform2019011215357

10.1016/j.ijmedinf.2018.10.008

30545489

107

Delorme

Charvet

Wartelle

Natural language processing for patient selection in Phase I or II oncology clinical trials

JCO Clin Cancer Inform2021065709718

10.1200/CCI.21.00003

34197179

108

Kehl

Lepisto

Natural language processing to ascertain cancer outcomes from medical oncologist notes

JCO Clin Cancer Inform2020084680690

10.1200/CCI.20.00020

32755459

109

DiMartino

Miano

Wessell

Bohac

Hanson

Identification of uncontrolled symptoms in cancer patients using natural language processing

J Pain Symptom Manage202204634610617

10.1016/j.jpainsymman.2021.10.014

34743011

110

Zeng

Banerjee

Henry

Natural language processing to identify cancer treatments with electronic medical records

JCO Clin Cancer Inform2021045379393

10.1200/CCI.20.00173

33822653

111

Bozkurt

Paul

Coquet

Phenotyping severity of patient-centered outcomes using clinical notes: a prostate cancer use case

Learn Health Syst20201044e10237

10.1002/lrh2.10237

33083539

112

Meystre

Heider

Cates

Piloting an automated clinical trial eligibility surveillance and provider alert system based on artificial intelligence and standard data models

BMC Med Res Methodol2023041123188

10.1186/s12874-023-01916-6

37041475

113

Araki

Matsumoto

Togo

Real-world treatment response in Japanese patients with cancer using unstructured data from electronic health records

Health Technol202303132253262

10.1007/s12553-023-00739-1

114

Guan

Cho

Petro

Zhang

Pasche

Topaloglu

Natural language processing and recurrent network models for identifying genomic mutation-associated cancer treatment change from patient progress notes

JAMIA Open20190421139149

10.1093/jamiaopen/ooy061

30944913

115

An investigation of single-domain and multidomain medication and adverse drug event relation extraction from electronic health record notes using advanced deep learning models

J Am Med Inform Assoc2019071267646654

10.1093/jamia/ocz018

116

Dai

Wang

Chen

Jonnagaddala

Cohort selection for clinical trials using multiple instance learning

J Biomed Inform202007107103438

10.1016/j.jbi.2020.103438

32360937

117

Forsyth

Barzilay

Hughes

Machine learning methods to extract documentation of breast cancer symptoms from electronic health records

J Pain Symptom Manage20180655614921499

10.1016/j.jpainsymman.2018.02.016

29496537

118

Yuan

Cai

Hong

Performance of a machine learning algorithm using electronic health record data to identify and estimate survival in a longitudinal cohort of patients with lung cancer

JAMA Netw Open202107147e2114723

10.1001/jamanetworkopen.2021.14723

34232304

119

Banerjee

Seneviratne

Weakly supervised natural language processing for assessing patient-centered outcome following prostate cancer treatment

JAMIA Open20190421150159

10.1093/jamiaopen/ooy057

31032481

120

Agaronnik

Lindvall

El-Jawahri

Iezzoni

Challenges of developing a natural language processing method with electronic health records to identify persons with chronic mobility disability

Arch Phys Med Rehabil2020101011017391746

10.1016/j.apmr.2020.04.024

32446905

121

Leis

Casadevall

Albanell

Exploring the association of cancer and depression in electronic health records: combining encoded diagnosis and mining free-text clinical notes

JMIR Cancer2022071183e39003

10.2196/39003

35816382

122

Lin

FPY

Pokorny

Teng

Epstein

TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records

Sci Rep20170731716918

10.1038/s41598-017-07111-0

28761061

123

Redd

Shao

Zeng-Treitler

Identification of colorectal cancer using structured and free text clinical data

Health Informatics J202228414604582221134406

10.1177/14604582221134406

36300566

124

Liu

Pradhan

Druhl

Learning to detect and understand drug discontinuation events from clinical narratives

J Am Med Inform Assoc20191012610943951

10.1093/jamia/ocz048

31034028

125

Liu

Kulkarni

Witteveen-Lane

Chen

Chesla

MetBERT: a generalizable and pre-trained deep learning model for the prediction of metastatic cancer from clinical notes

AMIA Jt Summits Transl Sci Proc20222022331331338

35854741

126

Koleck

Topaz

Tatonetti

Characterizing shared and distinct symptom clusters in common chronic conditions through natural language processing of nursing notes

Res Nurs Health202112446906919

10.1002/nur.22190

34637147

127

Ehrentraut

Sundström

Dalianis

Exploration of known and unknown early symptoms of cervical cancer and development of a symptom spectrum—outline of a data and text mining based approach

Proceeding from CAiSE 2015 Industriy Track CEUR Workshop Proc

Jun 8-12, 2015

Stockholm, Sweden

3444

128

Lazic

Jakovljevic

Boban

Nosek

Loncar-Turukalo

Information extraction from clinical records: an example for breast cancer

2022 IEEE 21st Mediterranean Electrotechnical Conference (MELECON)

Jun 14-16, 2022

10.1109/MELECON53508.2022.9842995

129

Stevens

Kennedy

Churches

Applying and improving a publicly available medication NER pipeline in a clinical cancer EMR

Stud Health Technol Inform20240125310679684

10.3233/SHTI231051

38269895

130

Luo

Gandhi

Storey

Zhang

Han

Huang

A computational framework to analyze the associations between symptoms and cancer patient attributes post chemotherapy using EHR data

IEEE J Biomed Health Inform202111251140984109

10.1109/JBHI.2021.3117238

34613922

131

Luo

Storey

Gandhi

Zhang

Metzger

Huang

Analyzing the symptoms in colorectal and breast cancer patients with or without type 2 diabetes using EHR data

Health Informatics J202127114604582211000785

10.1177/14604582211000785

33726552

132

Schiappa

Contu

Culie

RUBY: natural language processing of French electronic medical records for breast cancer research

JCO Clin Cancer Inform20220766e2100199

10.1200/CCI.21.00199

35960900

133

Tan

Clarke

Chamie

Development and validation of an automated method to identify patients undergoing radical cystectomy for bladder cancer using natural language processing

Urol Pract20170945365372

10.1016/j.urpr.2016.09.011

37592698

134

Agaronnik

Lindvall

El-Jawahri

Iezzoni

Use of natural language processing to assess frequency of functional status documentation for patients newly diagnosed with colorectal cancer

JAMA Oncol202010161016281630

10.1001/jamaoncol.2020.2708

32880603

135

Afzal

Hussain

Ali Khan

Comprehensible knowledge model creation for cancer treatment decision making

Comput Biol Med20170382119129

10.1016/j.compbiomed.2017.01.010

136

Loda

Krebs

Danhof

Exploration of artificial intelligence use with ARIES in multiple myeloma research

J Clin Med201907987999

10.3390/jcm8070999

31324026

137

Tahabi

Storey

Luo

SymptomGraph: identifying symptom clusters from narrative clinical notes using graph clustering

SAC ’23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

Mar 27-31, 2023

10.1145/3555776.3577685

138

Percha

Pisapati

Gao

Schmidt

Natural language inference for curation of structured clinical registries from unstructured text

J Am Med Inform Assoc2021122829197108

10.1093/jamia/ocab243

34791282

139

Alba

Gao

Lee

Ascertainment of veterans with metastatic prostate cancer in electronic health records: demonstrating the case for natural language processing

JCO Clin Cancer Inform202109510051014

10.1200/CCI.21.00030

34570630

140

Kersloot

Lau

Abu-Hanna

Arts

Cornet

Automated SNOMED CT concept and attribute relationship detection through a web-based implementation of cTAKES

J Biomed Semantics2019091810114

10.1186/s13326-019-0207-3

31533810

141

Ahmad

Liu

Khan

Jiang

Burhan

BIR: biomedical information retrieval system for cancer treatment in electronic health record using transformers

Sensors (Basel)2023112323239355

10.3390/s23239355

38067736

142

Jamaluddin

Wibawa

Patient diagnosis classification based on electronic medical record using text mining and support vector machine

2021 International Seminar on Application for Technology of Information and Communication (iSemantic)

Sep 18-19, 2021

10.1109/iSemantic52711.2021.9573178

143

Shah

Luo

Kanakasabai

Tuason

Klopper

Neural networks for mining the associations between diseases and symptoms in clinical notes

Health Inf Sci Syst201912711

10.1007/s13755-018-0062-0

30588291

144

Rohanian

Jauncey

Nouriborji

Using bottleneck adapters to identify cancer in clinical notes under low-resource constraints

The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

Jun 13, 2023

10.18653/v1/2023.bionlp-1.5

145

Bouvry

Tvardik

Kergourlay

The SYNODOS Project: system for the normalization and organization of textual medical data for observation in healthcare

IRBM201604372109115

10.1016/j.irbm.2016.03.002

146

Rahimian

Warner

Jain

Davis

Zerillo

Joyce

Significant and distinctive n-grams in oncology notes: a text-mining method to analyze the effect of OpenNotes on clinical documentation

JCO Clin Cancer Inform201906319

10.1200/CCI.19.00012

31184919

147

Chen

Xie

Wang

Liu

Hao

A bibliometric analysis of natural language processing in medical research

BMC Med Inform Decis Mak2018032218Suppl 114

10.1186/s12911-018-0594-x

29589569

148

Casey

Davidson

Poon

A systematic review of natural language processing applied to radiology reports

BMC Med Inform Decis Mak2021063211179

10.1186/s12911-021-01533-7

34082729

149

Goff

Loehfelm

Automated radiology report summarization using an open-source natural language processing pipeline

J Digit Imaging201804312185192

10.1007/s10278-017-0030-2

29086081

150

Dong

Suárez-Paniagua

Whiteley

Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation

J Biomed Inform202104116103728

10.1016/j.jbi.2021.103728

33711543

151

Payrovnaziri

Chen

Rengifo-Moreno

Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review

J Am Med Inform Assoc202007127711731185

10.1093/jamia/ocaa053

32417928

152

Roberts

Datta

Deep learning in clinical natural language processing: a methodical review

J Am Med Inform Assoc2020031273457470

10.1093/jamia/ocz200

153

Manning

Raghavan

Schütze

Introduction to Information Retrieval2008

Cambridge University Press

9780511809071

154

Shivade

Raghavan

Fosler-Lussier

A review of approaches to identifying patient phenotype cohorts using electronic health records

J Am Med Inform Assoc201403212221230

10.1136/amiajnl-2013-001935

155

Zhu

Lin

Jiang

Large language model trained on clinical oncology data predicts cancer progression

NPJ Digit Med202581115

10.1038/s41746-025-01780-2

156

Gottlieb

New FDA policies could limit the full value of AI in medicine

JAMA Health Forum202502762e250289

10.1001/jamahealthforum.2025.0289

39913129

157

Van Laere

Muylle

Cornu

Clinical decision support and new regulatory frameworks for medical devices: are we ready for it?—A viewpoint paper

Int J Health Policy Manag20221219111231593163

10.34172/ijhpm.2021.144

34814678

158

Artsi

Sorin

Glicksberg

Challenges of implementing LLMs in clinical practice: perspectives

J Clin Med202509114176169

10.3390/jcm14176169

40943929

159

General Data Protection Regulation (GDPR)

Intersoft Consulting2025-01-05

https://gdpr-info.eu/

160

Health Insurance Portability and Accountability Act (HIPAA)

US Department of Health and Human Services2025-01-05

https://www.hhs.gov/hipaa/index.html

161

Moorthie

Hayat

Zhang

Rapid systematic review to identify key barriers to access, linkage, and use of local authority administrative data for population health research, practice, and policy in the United Kingdom

BMC Public Health202206282211263

10.1186/s12889-022-13187-9

35764951

162

Alsentzer

Publicly available clinical BERT embeddings

Proceedings of the 2nd Clinical Natural Language Processing Workshop ACL Anthology 2019

Jun 7, 2019

10.18653/v1/W19-1909

163

Peng

Yan

Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets

2019

BioNLP 2019—SIGBioMed Workshop on Biomedical Natural Language Processing, Proceedings of the 18th BioNLP Workshop and Shared Task

Aug 1, 2019

Florence, Italy

5865

10.18653/v1/W19-5006

164

Miranda-Escalada

Farré

Krallinger

Named entity recognition, concept normalization and clinical coding: overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results published online first

Iberian Languages Evaluation Forum 2020

Sep 22-24, 2024

10.5281/zenodo.3773228

165

Agrawal

Hegselmann

Lang

Kim

Sontag

Large language models are few-shot clinical information extractors

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Jun 7-11, 2022

10.18653/v1/2022.emnlp-main.130

166

Keloth

Selek

Chen

Social determinants of health extraction from clinical notes across institutions using large language models

NPJ Digit Med2025051781287

10.1038/s41746-025-01645-8

40379919

167

Ferrara

Should ChatGPT be biased? Challenges and risks of bias in large language models

FM2023

10.5210/fm.v28i11.13346

168

Thirunavukarasu

Ting

DSJ

Elangovan

Gutierrez

Tan

Ting

DSW

Large language models in medicine

Nat Med20230829819301940

10.1038/s41591-023-02448-8

37460753

169

Thirunavukarasu

Hassan

Mahmood

Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care

JMIR Med Educ202304219e46599

10.2196/46599

37083633

170

Kung

Cheatham

Medenilla

Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models

PLOS Digit Health20230222e0000198

10.1371/journal.pdig.0000198

36812645

171

Lin

FedNLP: benchmarking federated learning methods for natural language processing tasks

Findings of the Association for Computational Linguistics

Jul 10-15, 2021

10.18653/v1/2022.findings-naacl.13

172

Mohan

A study on performance limitations in federated learning

Preprint posted online on Jan 7, 2025

10.48550/arXiv.2501.03477

173

Xiang

Gao

Asynchronous federated learning on heterogeneous devices: a survey

Comput Sci Rev20231150100595

10.1016/j.cosrev.2023.100595

Multimedia Appendix 1

Search criteria.

Multimedia Appendix 2

Studies included in the review and variables extracted.

Multimedia Appendix 3

Models for non-English corpora.

Multimedia Appendix 4

Annotation methods for reference corpus. Annotation granularity ranged from the entity or concept level to the patient level, including sentence, document section, and document levels. No information: no description of annotation methods (studies that used existing tools, detailed methods described elsewhere).

Checklist 1

PRISMA-ScR checklist.