Background: Artificial intelligence (AI)–based cancer detectors (CAD) for mammography are starting to be used for breast cancer screening in radiology departments. It is important to understand how AI CAD systems react to benign lesions, especially those that have been subjected to biopsy.
Objective: Our goal was to corroborate the hypothesis that women with previous benign biopsy and cytology assessments would subsequently present increased AI CAD abnormality scores even though they remained healthy.
Methods: This is a retrospective study applying a commercial AI CAD system (Insight MMG, version 184.108.40.206; Lunit Inc) to a cancer-enriched mammography screening data set of 10,889 women (median age 56, range 40-74 years). The AI CAD generated a continuous prediction score for tumor suspicion between 0.00 and 1.00, where 1.00 represented the highest level of suspicion. A binary read (flagged or not flagged) was defined on the basis of a predetermined cutoff threshold (0.40). The flagged median and proportion of AI scores were calculated for women who were healthy, those who had a benign biopsy finding, and those who were diagnosed with breast cancer. For women with a benign biopsy finding, the interval between mammography and the biopsy was used for stratification of AI scores. The effect of increasing age was examined using subgroup analysis and regression modeling.
Results: Of a total of 10,889 women, 234 had a benign biopsy finding before or after screening. The proportions of flagged healthy women were 3.5%, 11%, and 84% for healthy women without a benign biopsy finding, those with a benign biopsy finding, and women with breast cancer, respectively (P<.001). For the 8307 women with complete information, radiologist 1, radiologist 2, and the AI CAD system flagged 8.5%, 6.8%, and 8.5% of examinations of women who had a prior benign biopsy finding. The AI score correlated only with increasing age of the women in the cancer group (P=.01).
Conclusions: Compared to healthy women without a biopsy, the examined AI CAD system flagged a much larger proportion of women who had or would have a benign biopsy finding based on a radiologist’s decision. However, the flagging rate was not higher than that for radiologists. Further research should be focused on training the AI CAD system taking prior biopsy information into account.
Breast cancer is the most common cancer among women worldwide. It ranks fifth as a cause of cancer deaths because of its relatively favorable prognosis, but in the last 20 years, the average annual increase in breast cancer incidence rate has been 1.4% [- ]. Screening programs have been clearly proven to reduce the mortality rate for breast cancer [ - ]. Retrospective studies have shown that outcomes might improve when radiologists combine mammography readings with an artificial intelligence (AI) system for computer-aided detection (CAD) [ - ]. Furthermore, reducing reading time with the assistance of an AI CAD system is possible [ , ]. An AI CAD system can be highly accurate for reading mammograms, and some systems are now on a comparable level with average breast radiologists at detecting breast cancer on screening mammography [ ].
In addition to the well-known risk factors of age, family history, and hormonal history, there are also studies showing that benign breast disease increases the risk of breast cancer [, ]. A study that analyzed risk factors for breast cancer found that having undergone any prior breast procedure was associated with an increased risk of breast cancer [ ]. Another study showed that women found to have false-positive mammography findings were more likely to develop interval cancer or cancer at the second screening compared to those not recalled [ ].
Radiologists performing screen reading normally have access to information about prior biopsies, while AI CAD systems do not take this information into account. In this retrospective study, we analyzed primarily to what extent the malignancy assessments of an AI CAD system are affected by the presence or absence of biopsy-proven benign findings. In a secondary analysis, we determined whether this effect differs between an AI CAD system and radiologists.
This retrospective study was based on a case-control subset from the Cohort of Screen-Aged Women (CSAW). The CSAW is a complete population-based cohort of women aged 40 to 74 years invited to screening in the Stockholm region, Sweden, between 2008 and 2015 . The exclusion criteria in the CSAW were having a prior history of breast cancer, having a diagnosis outside the screening range, and having had incomplete mammographic examinations. From the CSAW, a case-control subset was separately defined to contain all women from Karolinska University Hospital, Stockholm, who were diagnosed with breast cancer (n=1303), those at screening or clinical evaluation during the interval before the next planned screening, and 10,000 randomly selected healthy controls [ ]. The purpose of the case-control subset is to make evaluation more efficient by not having to process an unnecessary amount of healthy controls while preserving the representability of the CSAW screening cohort in which it is nested. Additional exclusion criteria for the current study were having implants and receiving a cancer diagnosis later than 12 months after mammography. The study population was divided into 3 groups based on their status: cancer, benign biopsy, and normal.
The cancer group was defined as having biopsy-verified breast cancer at screening or within 12 months of screening. The most recent mammographic screening prior to diagnosis was selected for analysis. The benign biopsy group was defined as having had a benign biopsy finding without ever having had breast cancer. The group was further stratified by the interval between biopsy and mammography. The normal group had neither breast cancer nor a prior benign biopsy finding. Women in the screening program who were previously recalled and deemed as having benign disease were also included in this group.
The screening system consisted of double-reading followed by consensus discussion for any flagged examination. The following screening decision data were collected: flagging of abnormal screening by one or both radiologists and the final recall decision after consensus discussion. Screening decisions and clinical outcome data were collected by linking to regional cancer center registers.
AI CAD system
The AI CAD system was an Insight MMG (version 220.127.116.11; Lunit Inc). The reason for choosing Insight MMG for this study was that it demonstrated superior results in a retrospective analysis published in 2020 , which compared 3 AI CAD systems with a sensitivity and specificity comparable to Breast Cancer Surveillance Consortium benchmarks [ ]. Briefly, the AI CAD system was originally trained on 170,230 mammograms from 36,468 women diagnosed with breast cancer and 133,762 healthy controls. The AI CAD system had been validated by previous studies using a deep learning model to triage screening mammograms [ , ]. The mammograms in the original training set were sourced from 5 institutions: 3 from South Korea, 1 from the United States, and 1 from the United Kingdom. The mammograms were acquired on mammography equipment from GE Healthcare, Hologic, and Siemens, and there were both screening and diagnostic mammograms. The generated prediction score for tumor presence was a decimal number between 0.00 and 1.00, where 1.00 represented the highest level of suspicion. The program assessed 2 images of each breast, and the highest score among the 4 images was selected to represent the examination. To obtain a binary assessment, determining whether the examination should be considered flagged for further workup by the AI CAD system, a cutoff point is required, above which the examination is considered flagged and below which the examination is considered not flagged. The cutoff point (0.40; AI abnormality threshold) defined whether an examination was considered flagged or not flagged by an AI CAD system, and was predefined in a prior study [ ]. The cutoff was selected to enforce that the specificity of the AI CAD system should be the same as that for the average radiologist in that study. The examinations in the prior study originated from the same institution and are partly overlapping, which should ensure that the cutoff value is transferrable to the current setting.
The Stockholm-Gotland Regional Cancer Center provided personal identification numbers for all women who fulfilled the inclusion criteria for the CSAW. The identification numbers were linked to the local breast cancer quality register, “Regional Cancercentrum Stockholm-Gotlands Kvalitetsregister för Bröstcancer,” to collect data about breast cancer diagnosis. All diagnoses of breast cancer were biopsy verified. Benign diagnoses were collected from hospital electronic health records. All images were 2D full-field digital mammograms acquired on Hologic mammography equipment. The personal identification numbers were also linked to the radiological image repository to extract all digital mammograms from the Picture Archiving and Communication System.
Statistical analysis was performed per patient and not per lesion. Stata (version 14 or later; StataCorp) was used for statistical analyses. The Wilcoxon rank-sum test and quantile regression analysis were used to examine differences between groups. To perform statistical tests, differences in medians were chosen due to the skewed distribution of AI scores. The required level for statistical significance was not adjusted for multiple comparisons. A value of P<.05 was considered statistically significant.
The collection and use of the data set by AI was approved by the Swedish Ethical Review Board (2017-02-08), and the need for informed consent was waived (diary number 2016/2600-31).
We evaluated 11,303 women for inclusion in this retrospective case-control study (). Of them, 414 women were excluded. The exclusion criteria were no mammographic examination in conjunction with a cancer diagnosis, having implants, and having cancer more than 12 months after mammography. The cancer group consisted of a total of 917 women, the benign biopsy group comprised 234 women, and the group with no cancer or biopsy (control group) comprised 9738 women.
Of the remaining 10,889 women, 8269 had complete information regarding radiologist assessments (when performing data collection, we received radiologist assessments only until December 31, 2015), which included selections rendered as potentially pathological by 1 or both radiologists and a final recall decision after consensus discussion. From those 8269 women, there were 724 women in the cancer group, 212 in the benign biopsy group, and 7371 in the normal mammography group.
There was a significant difference (P<.001) in AI scores among the cancer, benign biopsy, and normal mammography groups ().
|Characteristics||Participants, n||Age (years), median (IQR)||Proportion of assessments above the cutoff point, % (n/n)||AIa score, median (IQR)||P value (comparison with the cancer group)||P value (comparison with the normal biopsy group)|
|Normal with benign biopsy||234||51.2 (45.6-53.0)||11 (26/234)||0.051 (0.016-0.174)||<.001||<.001|
|Cancer||917||60.6 (50.7-66.6)||83 (768/917)||0.933 (0.666-0.983)||N/Ab||<.001|
|Normal without biopsy||9738||55.5 (48.8-65.2)||3.5 (345/9738)||0.018 (0.005-0.065)||<.001||N/A|
aAI: artificial intelligence.
bN/A: not applicable.
The proportion of AI assessments above the cutoff point was 3.5% in the group with normal mammography findings and 83% in the cancer group. In the benign biopsy group, 11% of the AI assessments were above the cutoff point. The distribution of AI scores for women diagnosed with breast cancer is shown in, that for healthy women with a benign biopsy in , and that for healthy women without a benign biopsy who remained healthy in .
In, we show how the AI score is associated with the age of the women. There was a significant increase of the AI score in relation to age category in the cancer group (P<.05). There was no significant increase in the AI score in relation to age in the group with normal mammographic findings or in the group with benign biopsy findings. The median age for the study population was 56 years, and the median AI score was 0.023. The median age for the cancer group was 61 years, and the median AI score was 0.933 (P=.01). The median age of the group with previous benign biopsies was 49 years, and the median AI score was 0.051 (P=.71). The median AI score for healthy women was 0.018 (P=.40), and the median age of that group was 59 years.
The benign biopsy group was stratified by the interval between biopsy and mammography into 3 categories: 0-6 months, 6-24 months, and more than 24 months. There was no significant difference among the time-stratified categories (). In , we describe the AI score related to the time between biopsy and mammography. Within 6 months after mammography, 104 of 234 participants had had a benign biopsy finding. The proportion of women with AI scores above the threshold was 16% for those with a benign biopsy finding within 6 months from mammography and 33% for those with a benign biopsy finding 6 months before mammography.
In the group with a benign biopsy finding after mammography, the proportion of abnormal assessments by AI, above cutoff point, was 15%, while the radiologists had a recall rate up to 57% for this group (). The radiologists had a recall rate of 2%, and the rate for abnormal assessments by AI was 3.8% in the group with normal mammograms and that with benign biopsy findings ( ). For the group with only normal mammograms, the recall rates were 1% and 3.6%, respectively. Radiologists and the AI program had similar rates of recall for the total study population.
The 2 screening mammograms shown inand have been assessed by radiologists and the AI cancer detection program. These examples illustrate concordant and discordant assessments.
|Age group (years)||Normal mammography||Benign biopsy group||Cancer group|
|All groups (P=.40)||All groups (P=.71)||Benign biopsy findings before mammography (P=.71)||Benign biopsy findings after mammography (P=.81)||All groups (P=.01)||Prior benign biopsy findings (P=.54)||No benign biopsy findings (P=.01)|
|Participants, n||AI score, median (IQR)||Participants, n||AI score, median (IQR)||Participants, n||AI score, median (IQR)||Participants, n||AI score, median (IQR)||Participants, n||AI score, median (IQR)||Participants, n||AI score, median (IQR)||Participants, n||AI score, median (IQR)|
|All||9738||0.018 (0.005-0.065)||234||0.051 (0.016-0.174)||64||0.069 (0.018-0.179)||170||0.047 (0.016-0.205)||917||0.933 (0.667-0.983)||20||0.924 (0.747-0.986)||897||0.934 (0.666-0.983)|
|40-49||3152||0.021 (0.006-0.068)||144||0.048 (0.016-0.155)||43||0.074 (0.019-0.142)||101||0.042 (0.012-0.176)||213||0.861 (0.272-0.974)||9||0.861 (0.286-0.974)||204||0.864 (0.264-0.975)|
|50-59||2834||0.015 (0.005-0.058)||68||0.051 (0.017-0.260)||16||0.039 (0.014-0.113)||52||0.051 (0.017-0.305)||228||0.948 (0.715-0.985)||3||0.911 (0.891-0.950)||225||0.949 (0.709-0.985)|
|60-69||2577||0.018 (0.005-0.064)||22||0.078 (0.012-0.174)||5||0.173 (0.090-0.449)||17||0.047 (0.011-0.144)||379||0.935 (0.723-0.984)||8||0.945 (0.887-0.991)||371||0.935 (0.720-0.984)|
|≥70||1175||0.020 (0.006-0.077)||0||N/Aa||0||N/A||0||N/A||97||0.958 |
aN/A: not applicable.
|Timing of biopsy||Participants, n||Age (years), median (IQR)||Proportion of assessments above the cutoff point, % (n/n)||AI score||P value|
|Benign biopsy finding before mammography (months)||64|
|0-6||9||49.1 (44.5-54.9)||33 (3/9)||0.150 (0.099-0.449)||Reference|
|6-24||39||48.1 (46.7-52.7)||5.1 (2/39)||0.062 (0.016-0.150)||.12|
|>24||16||41.0 (40.3-48.4)||0 (0)||0.025 (0.015-0.099)||.06|
|Benign biopsy after mammography (months)||170|
|0-6||104||48.3 (44.4-52.5)||16 (17/104)||0.065 (0.017-0.274)||Reference|
|6-24||44||49.0 (45.6-55.3)||4.5 (2/44)||0.037 (0.008-0.109)||.36|
|>24||22||51.2 (45.6-53.0)||9 (2/22)||0.031 (0.017-0.165)||.38|
|Assessments||Recall rate, % (n/n)|
|Radiologist 1||Radiologist 2||Consensus||Abnormal AI assessments above the cutoff point|
|Total||10 (831/8307)||10 (827/8307)||9.2 (767/8307)||11 (880/8307)|
|Normal and benign biopsy findings||3.6 (274/7583)||3.1 (234/7583)||2 (154/7583)||3.8 (290/7583)|
|Normal||2.8 (203/7371)||2.2 (162/7371)||1 (73/7371)||3.6 (265/7371)|
|Benign biopsy findings||33 (71/212)||34 (72/212)||38 (81/212)||12 (25/212)|
|Biopsy before mammography||8.5 (5/59)||6.8 (4/59)||6.8 (4/59)||8.5 (5/59)|
|Biopsy after mammography||49 (66/135)||50 (68/135)||57 (77/135)||15 (20/135)|
|Cancer||77 (557/724)||82 (593/724)||85 (613/724)||81 (590/724)|
|With benign biopsy findings||75 (12/16)||88 (14/16)||88 (14/16)||81 (13/16)|
|Without benign biopsy findings||77 (545/708)||82 (579/708)||85 (599/708)||82 (577/708)|
The AI CAD system in this study showed increased flagging of screening examinations for women with benign biopsy findings compared to those for healthy women without biopsies. However, the flagging rate was similar between AI and radiologists for women with a prior biopsy finding, and considerably lower for women with a biopsy finding after screening.
For women with a previous benign biopsy finding, compared to healthy women, the AI CAD system’s flagging rate (false positives) increased from 3.6% to 8.5%. In other words, there was a significant difference in AI scores between the normal group and the benign biopsy group despite both groups consisting of women without breast cancer. This finding might raise questions about the probability that AI is affected by alternations on mammography because of the biopsy. This did not seem to be the case, since we found a similar increase in recall rate for the radiologists from 2.8% to 8.5%. This is unexpected since radiologists had access to the outcomes of prior biopsies while AI did not.
For women who had a benign biopsy finding after screening, we found that 57% of them resulted from recall by the screening radiologists. Applying the AI CAD system in screening would have resulted in a much lower false positive flagging rate of only 15% for the AI program. Based on this observation, one may suggest further research on the role of AI in reducing the number of unnecessary biopsies.
The strength of this study is the large number of women with cancer and that all women were sampled from a screening cohort. Another strength of this study is the use of the specific AI algorithm, which has already been validated in large cohorts with very positive results . Our data of the total recall rates and specifically those of the cancer group amplify the indications from previous studies that AI-based cancer detectors can be reliable enough to be incorporated in a screening setting.
The main limitation is the relatively small number of benign biopsies, which makes it difficult to consider the effect of different types of benign lesions. Another limitation is the study’s retrospective setting. Since the AI program did not have the opportunity to make recalls and choose women for further diagnostic biopsy, it could not influence who received a biopsy after screening, and all decisions about benign biopsies were based on radiologists’ assessments. In contrast to radiologists, the AI program calculates a score for the likelihood of breast cancer based on the image alone and does not consider any information about symptoms given by the woman at screening.
Furthermore, in this study, we did not consider the exact location of the presumed abnormality where the AI program revealed a high AI score. Further analysis of the data can be valuable to evaluate whether the lesions that AI showed responded to the actual finding that the patient was recalled for.
In conclusion, the tested AI CAD system had an increased flagging rate of 8.5% for women with a prior benign biopsy finding; this rate was not higher than that for radiologists who often have information about prior benign biopsy findings. Further research and development might be focused on how to further improve AI CAD systems by taking into account information about prior benign biopsy findings.
Conflicts of Interest
FS received speaker fees from Lunit and Pfizer and is a shareholder in ClearScanAI.
Artificial intelligence (AI) score distribution for the cancer group (n=917).DOCX File , 16 KB
Artificial intelligence (AI) score distribution for the benign biopsy group (n=234).DOCX File , 16 KB
Artificial intelligence (AI) score distribution for the normal group (n=9738).DOCX File , 16 KB
- Wigzell K, Rosen M. Socialstyrelsen - The National Board of Health and Welfare/Centre for Epidemiology - Foreword. Scand J Public Health 2001:1
- Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer 2015 Mar 01;136(5):E359-E386 [https://onlinelibrary.wiley.com/doi/10.1002/ijc.29210] [CrossRef] [Medline]
- Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2021 May 04;71(3):209-249 [https://onlinelibrary.wiley.com/doi/10.3322/caac.21660] [CrossRef] [Medline]
- Nyström L, Rutqvist LE, Wall S, Lindgren A, Lindqvist M, Rydén S, et al. Breast cancer screening with mammography: overview of Swedish randomised trials. Lancet 1993 Apr 17;341(8851):973-978 [CrossRef] [Medline]
- Marmot MG, Altman DG, Cameron DA, Dewar JA, Thompson SG, Wilcox M. The benefits and harms of breast cancer screening: an independent review. Br J Cancer 2013 Jun 11;108(11):2205-2240 [https://europepmc.org/abstract/MED/23744281] [CrossRef] [Medline]
- Long H, Brooks JM, Harvie M, Maxwell A, French DP. Correction: How do women experience a false-positive test result from breast screening? A systematic review and thematic synthesis of qualitative studies. Br J Cancer 2021 Sep 30;125(7):1031-1031 [https://europepmc.org/abstract/MED/34331024] [CrossRef] [Medline]
- Rodríguez-Ruiz A, Krupinski E, Mordang J, Schilling K, Heywang-Köbrunner SH, Sechopoulos I, et al. Detection of breast cancer with mammography: effect of an artificial intelligence support system. Radiology 2019 Feb;290(2):305-314 [CrossRef] [Medline]
- Schaffter T, Buist DSM, Lee CI, Nikulin Y, Ribli D, Guan Y, the DM DREAM Consortium, et al. Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms. JAMA Netw Open 2020 Mar 02;3(3):e200265 [https://europepmc.org/abstract/MED/32119094] [CrossRef] [Medline]
- Salim M, Wåhlin E, Dembrower K, Azavedo E, Foukakis T, Liu Y, et al. External evaluation of 3 commercial artificial intelligence algorithms for independent assessment of screening mammograms. JAMA Oncol 2020 Oct 01;6(10):1581-1588 [https://europepmc.org/abstract/MED/32852536] [CrossRef] [Medline]
- Rodriguez-Ruiz A, Lång K, Gubern-Merida A, Teuwen J, Broeders M, Gennaro G, et al. Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. Eur Radiol 2019 Sep 16;29(9):4825-4832 [https://europepmc.org/abstract/MED/30993432] [CrossRef] [Medline]
- Dembrower K, Wåhlin E, Liu Y, Salim M, Smith K, Lindholm P, et al. Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study. Lancet Digit Health 2020 Sep;2(9):e468-e474 [CrossRef]
- Rodriguez-Ruiz A, Lång K, Gubern-Merida A, Broeders M, Gennaro G, Clauser P, et al. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. J Natl Cancer Inst 2019 Sep 01;111(9):916-922 [https://europepmc.org/abstract/MED/30834436] [CrossRef] [Medline]
- Dumitrescu RG, Cotarla I. Understanding breast cancer risk -- where do we stand in 2005? J Cell Mol Med 2005 Jan;9(1):208-221 [https://europepmc.org/abstract/MED/15784178] [CrossRef] [Medline]
- Hartmann LC, Sellers TA, Frost MH, Lingle WL, Degnim AC, Ghosh K, et al. Benign breast disease and the risk of breast cancer. N Engl J Med 2005 Jul 21;353(3):229-237 [CrossRef]
- Barlow W, White E, Ballard-Barbash R, Vacek P, Titus-Ernstoff L, Carney P, et al. Prospective breast cancer risk prediction model for women undergoing screening mammography. J Natl Cancer Inst 2006 Sep 06;98(17):1204-1214 [CrossRef] [Medline]
- McCann J, Stockton D, Godward S. Impact of false-positive mammography on subsequent screening attendance and risk of cancer. Breast Cancer Res 2002 Oct 1;4(5):R11 [https://breast-cancer-research.biomedcentral.com/articles/10.1186/bcr455] [CrossRef] [Medline]
- Dembrower K, Lindholm P, Strand F. A multi-million mammography image dataset and population-based screening cohort for the training and evaluation of deep neural networks-the Cohort of Screen-Aged Women (CSAW). J Digit Imaging 2020 Apr 13;33(2):408-413 [https://europepmc.org/abstract/MED/31520277] [CrossRef] [Medline]
- Lehman CD, Arao RF, Sprague BL, Lee JM, Buist DSM, Kerlikowske K, et al. National performance benchmarks for modern screening digital mammography: update from the Breast Cancer Surveillance Consortium. Radiology 2017 Apr;283(1):49-58 [https://europepmc.org/abstract/MED/27918707] [CrossRef] [Medline]
- Yala A, Schuster T, Miles R, Barzilay R, Lehman C. A deep learning model to triage screening mammograms: a simulation study. Radiology 2019 Oct;293(1):38-46 [CrossRef] [Medline]
|AI: artificial intelligence|
|CAD: computer-aided detection|
|CSAW: Cohort of Screen-Aged Women|
|RCC: Regional Cancer Center|
Edited by K El Emam, G Eysenbach; submitted 12.04.23; peer-reviewed by A Karakatsanis, N Mungoli; comments to author 07.05.23; revised version received 17.06.23; accepted 03.08.23; published 31.08.23Copyright
©Athanasios Zouzos, Aleksandra Milovanovic, Karin Dembrower, Fredrik Strand. Originally published in JMIR AI (https://ai.jmir.org), 31.08.2023.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR AI, is properly cited. The complete bibliographic information, a link to the original publication on https://www.ai.jmir.org/, as well as this copyright and license information must be included.