Preventing Delayed and Missed Care by Applying Artificial Intelligence to Trigger Radiology Imaging Follow-up
Abstract
Medical diagnostic imaging studies frequently detect findings that require further evaluation. An initiative at Northwestern Medicine was designed to prevent delays and improve outcomes by engineering reliable follow-up of radiographic findings. An artificial intelligence natural language processing (NLP) system was developed to identify radiology reports containing lung- and adrenal-related findings requiring follow-up. Over 13 months, more than 570,000 imaging studies were screened, of which more than 29,000 were flagged as containing lung-related follow-up recommendations, representing a 5.1% rate of lung-related findings occurrence on relevant imaging studies and an average of 70 findings flagged per day. Northwestern’s prospective clinical validation of the system, the first of its kind, demonstrated a sensitivity of 77.1%, specificity of 99.5%, and positive predictive value of 90.3% for lung findings requiring follow-up. To date, the workflow has generated nearly 5,000 interactions with ordering physicians and has tracked more than 2,400 follow-ups to completion. The authors conclude that NLP demonstrates significant potential to improve reliable follow-up to imaging findings and, thus, to reduce preventable morbidity in lung pathology and other high-risk and problem-prone areas of medicine.
An early article in the patient safety literature was titled with a passionate plea: “I wish I had seen this test result earlier!”1 Additional work has highlighted the dangers of delayed recognition and intervention2 to protect patient health, as when an imaging or laboratory study calls to attention a finding of potential concern.3 Sometimes the results are returned after the patient has left the hospital or ED4; often, they are buried as “incidental” findings, and the busy clinician overlooks them in a brisk check of the results for what was specifically sought in the study.5,6 These errors are devastating to patient and clinician alike. The clinician’s plaintive cry for timely and visible information, vital to directing the patient to the right care, continues to echo nearly 20 years later.7,8
Medical imaging is an important and widely used diagnostic tool,9 with computed tomography (CT) scans and MRI scans performed at respective annual utilization rates of 245 and 118 per 1,000 population in the United States.10 Radiologists frequently recommend additional follow-up studies on the basis of their interpretations, with such recommendations appearing in 6% to 20% of radiology reports.11-14 While some of these findings reflect exactly what the ordering physician sought to learn, others are incidentalomas, or findings not relevant to the original study’s purpose. In a busy clinician’s workflow, follow-up recommendations can be missed.
While many findings are benign, those representing potential malignancies require follow-up imaging or clinical workup to enable timely diagnosis and treatment.15,16 However, adherence to follow-up recommendations is lacking, with published follow-up rates ranging from one-half to two-thirds,12,13,17-20 potentially contributing to suboptimal patient outcomes. While patient factors and social determinants may prevent appropriate follow-up,21 clear documentation and tracking of follow-up recommendations in radiology reports represent a controllable area of improvement. Studies have shown an association between clear communication of follow-up recommendations by radiologists and increased follow-up adherence,21-23 and quality improvement initiatives to manually oversee the documentation of incidental findings have successfully increased adherence to follow-up recommendations.24
Considering these clinical needs, artificial intelligence (AI) is well suited to the detection and reporting of follow-up recommendations because of the large volume of imaging studies requiring screening and the relatively standardized language employed by radiologists in preparing reports. Natural language processing (NLP) methods, including text pattern-matching25-28 and traditional machine-learning techniques,12,29-31 have been developed for this task. In this article, we use the term traditional machine learning to refer to all machine-learning methods that are not deep learning, and these terms will be defined in detail in the sections that follow. More recently, novel deep-learning methods for NLP have shown great promise for the detection of follow-up recommendations.32-35 However, methods reported to date are limited by the size of the data sets used for model training, as well as by a lack of prospective evaluation and implementation in clinical settings.
The Health System Initiative
Northwestern Medicine (NM) is an integrated academic health system based in Chicago, comprising more than 4,800 physicians practicing across 11 hospitals and serving more than 1.3 million patients annually (Appendix, Exhibit A). As part of ongoing quality improvement efforts at NM, a risk assessment was performed in 2018 to identify potential priorities for patient safety. One area of improvement identified by this effort was communication of diagnostic imaging results to physicians and patients. In particular, there existed no consistent health system–wide method of identifying findings requiring follow-up in radiology reports and tracking management of these findings to completion. In light of this, a Result Management initiative was chartered in August 2018 to create a streamlined, closed-loop system to ensure prompt and reliable identification and tracking of diagnostic imaging results requiring follow-up.
Our implementation of the Result Management system demonstrates the value of using AI to automate an otherwise labor-intensive manual task, screening well over 1,000 reports and identifying dozens of follow-up recommendations on a daily basis.
The team included staff with expertise in Radiology, Quality, Patient Safety, Process Improvement, Primary Care, Nursing and Informatics and other stakeholders. We first surveyed prior work on this problem within our health system. Individual radiology departments had initiated efforts to standardize finding documentation among radiologists by using templated language, but these were adopted inconsistently. Moreover, most of the NM hospitals had not built a formal system to track findings; however, one group had created a well-developed solution relying on manual reporting by radiologists and monitoring of follow-up by a dedicated team.
To inform our design process for the new Result Management system, we considered limitations in the existing systemwide workflow. First, adoption among radiologists was limited. A common concern was the burden to the clinical workflow, because most radiologists work primarily in the picture-archiving and communication system rather than the electronic health record (EHR) itself. Safety solutions cannot achieve high reliability when they rely on extra steps and vigilance to remember to open an adjunct system and click another item. The potential for scalability was also limited, because it would require deployment of the two-step workflow across the health system and training and reinforcement to ensure consistent adoption among radiologists. Finally, because it was burdensome, this workflow was used only to track follow-up of incidentally encountered findings. Yet, expected findings on imaging studies also often demand follow-up, and we sought to reduce variation and improve reliable attention to these findings as well.
We decided to develop an EHR-integrated NLP system to automatically identify radiographic findings requiring follow-up (Figure 1).
Figure 1
Delegating the identification of findings requiring follow-up to AI provides a scalable solution that does not require any change to the radiologist workflow. In this new system, launched in December 2020 across the shared NM EHR, relevant signed radiology reports are screened by the NLP system in real time, and a Best Practice Advisory (BPA) is generated to alert the ordering physician and present a workflow in which follow-up studies can be ordered. Through multiple improvement cycles, additional safety layers were established to provide manual oversight of system performance and track follow-up completion.
Two types of findings, lung and adrenal, were included on the basis of input from physician stakeholders and their consensus review of existing guidelines for the management of incidentally discovered radiographic findings. Lung-related findings, whether the focus of the study or incidentally discovered, are among the most commonly encountered radiographic findings that require additional follow-up.36 Given the anticipated high volume of lung findings and the structured clinical approach guiding their follow-up, we considered detection of lung findings requiring follow-up to be a realistic and impactful domain for clinical implementation of this system. In contrast, adrenal findings have a much lower incidence.27 Because the performance of deep-learning models depends directly on the volume of data available for training, we targeted adrenal findings to push the limits of the system and to highlight the challenges to overcome for future expansion to other findings.
This quality improvement effort was chartered through our Quality Management program and did not require review by the Institutional Review Board. A multidisciplinary team, including health system leadership, clinical experts, EHR analysts, and health informatics experts, was assembled to develop the technology. The design phase of the project, including initial data exploration and preliminary modeling, was initiated in August 2018 and had progressed to model development and data acquisition by January 2020 (Figure 2). The Result Management system was deployed in the EHR in December 2020, with continued monitoring and model updating since then.
Figure 2
Model Prototyping
We began by prototyping various NLP approaches to understand the scope of the problem and to inform our initial approach, starting with regular expressions (regex). Regex patterns are manually defined representations of word sequences of interest that can be used to identify radiology report text via pattern matching. Because of its relative simplicity and ease of implementation, the regex method provides a useful baseline for the evaluation of future approaches. We obtained an initial corpus of 200 radiology reports from our institution and annotated these for any findings requiring follow-up. On the basis of this corpus, 14 regex patterns were successfully developed in an iterative design process to capture both the finding description as well as the follow-up that was recommended by the radiologist. A clinical expert validated that all reports containing actionable findings were identified with 100% sensitivity and specificity. All regex patterns used are provided in the Appendix, Exhibit B.
Considering that regex patterns may easily miss findings not anticipated during regex development, as well as the inherent difficulty of scaling up the regex approach, we opted to explore more sophisticated methods for this task.
However, because of the endless diversity of ways in which radiologists may document a finding and then recommend a follow-up, the regex method was deemed too brittle a solution for the problem. When we evaluated the regex approach on a larger data set comprising 10,916 labelled radiology reports containing 1,857 findings, the sensitivity and specificity fell to 74% and 82%, respectively, with an overall accuracy of 77% and positive predictive value (PPV) of 45% (Appendix, Exhibit C). Because it is a text search method, minor discrepancies such as misspellings or varied word placement may render relevant findings invisible to regex patterns, resulting in false negatives. False positives may also occur if additional language is used to qualify the strength of the follow-up recommendation. Additionally, documentation practices may systematically change with time, and uncommon types of findings may not be sufficiently accounted for during regex pattern development. Considering that regex patterns may easily miss findings not anticipated during regex development, as well as the inherent difficulty of scaling up the regex approach, we opted to explore more sophisticated methods for this task.
Next, we evaluated various machine-learning methods, which automate the process of learning problem–specific features and are thus better suited to handling the inherent variability in radiology reports. To determine the model best suited to detection of follow-up recommendations, we performed initial modeling on the annotated corpus of 200 radiology reports. We started with traditional NLP machine-learning methods such as logistic regression, using the Bag-of-Words method to convert our data from text to tabular data. Bag of Words is a technique that works by counting the number of times a word from selected vocabulary appears in the text sample37,38 and therefore represents text as a numeric vector that can be fed to a model (Figure 3).
Figure 3
However, a major disadvantage of this technique — and with traditional machine-learning NLP models in general — is that vectorizing model input data in this way disregards the order of words in a given text. This results in less informative input data for a model to make predictions on and significantly hampers performance.39
We saw slightly improved performance using LightGBM40 and XGBoost,41 widely used traditional machine-learning models that use gradient boosting. Gradient boosting is a machine-learning technique that builds an ensemble model out of many highly specific models. Typically, an ensemble of decision trees is built one by one, with each decision tree trained to maximize performance for samples incorrectly classified by the previous tree. This process is repeated to yield an ensemble of highly specialized decision trees that make a strong predictive model when used together.
This trend of improved performance with increased model complexity emphasizes the benefits of selecting machine-learning architectures that are well tailored to the problem domain of interest. Compared with other forms of clinical data, such as images or laboratory values, radiology reports are relatively unstructured and variable in nature, necessitating use of models that take flexible inputs and are sufficiently complex to handle the abstractions of written text. Recent advances in deep-learning techniques have substantially improved the capacity of machine-learning models to perform NLP tasks,42 presenting a significant opportunity for us. Deep learning is a type of machine learning defined by the use of representation learning, in which the model learns features and representations of input data as a part of the training process. This is particularly beneficial for NLP tasks, because an understanding of language requires knowledge of contextual information in addition to vocabulary.
On the basis of comparisons of model performance on our initial data set, we decided to proceed with model development using a type of deep-learning architecture called bidirectional long short-term memory (BiLSTM). BiLSTMs are a form of recurrent neural network, which is a deep-learning architecture that operates on sequentially organized data such as text, preserving the position information for each word.43 A BiLSTM processes input data in both the forward and that backward direction, enabling it to learn word dependencies in a text corpus (Figure 4).
Figure 4
Use of the BiLSTM model, which has been shown to outperform traditional machine-learning methods on benchmark NLP tasks, including text classification, allowed us to capitalize on performance improvements of deep learning.
On the basis of comparisons of model performance on our initial data set, we decided to proceed with model development using a type of deep-learning architecture called bidirectional long short-term memory (BiLSTM).
To further take advantage of advances in deep-learning strategies, we adopted another NLP technique called word embedding. To prepare text for use with machine-learning algorithms, individual words or word fragments must first be converted into a numeric machine-readable format. Word embeddings are created by using deep learning in an unsupervised manner to analyze large databases of text, ultimately creating high-dimensional vector representations of words such that similar words are close to each other in the vector space. We evaluated two word embeddings: GloVe, which was pretrained using Wikipedia and Gigaword as sources of text,44 and BioWordVec, which was trained using a text corpus derived from PubMed and Medical Subject Headings data.45 We decided to use GloVe, which yielded improved performance compared with BioWordVec, despite the focus of the latter on biomedical text. We believe that GloVe’s much more extensive training set, despite not focusing exclusively on biomedical text, afforded us better results purely on the basis of its larger size.
Data Acquisition
Deep-learning models require large amounts of training data to optimize performance, particularly when the outcome of interest, such as a follow-up recommendation, accounts for a relatively small proportion of the total volume of data. To facilitate model development with a representative sample of radiology reports, we used the Enterprise Data Warehouse (EDW), a computer database independently maintained by NM drawing from a variety of clinical systems, such as our EHR. We queried all radiology reports generated from June 2018 to June 2019 that were associated with imaging studies that could potentially contain a lung- or adrenal-related follow-up recommendation. The list of these potential studies was derived by a panel of clinical experts on the basis of the anatomic location of each scan and included predominantly CT, X-ray (XR), and MRI scans (Table 1). The very few excluded studies pertained to anatomy unrelated to this project, such as XRs of the foot.
Table 1
Imaging modality | Total number of reports | Number of reports with lung follow-up recommendation (%) | Number of reports with adrenal follow-up recommendation (%) |
---|---|---|---|
CT abdomen pelvis with contrast | 8,138 | 1,187 (14.6) | 171 (2.1) |
XR chest PA LAT | 5,303 | 453 (8.5) | 0 (0) |
XR chest AP portable | 5,080 | 217 (4.3) | 0 (0) |
CT chest abdomen pelvis with contrast | 2,192 | 347 (15.8) | 22 (1.0) |
CT abdomen pelvis without contrast | 1,951 | 280 (14.4) | 29 (1.5) |
CT chest without contrast | 1,859 | 1138 (61.2) | 11 (0.6) |
CT chest with contrast | 1,398 | 647 (46.3) | 24 (1.7) |
CT abdomen pelvis spiral for stone | 1,030 | 85 (8.3) | 14 (1.4) |
CTA chest, pulmonary embolism protocol | 978 | 354 (36.2) | 13 (1.3) |
XR abdomen AP | 658 | 3 (0.5) | 1 (0.2) |
This table shows the number of lung and adrenal follow-up recommendations in the top 10 most common imaging studies that were included in the final enriched dataset used for model development and validation. CT = computed tomography, XR = X-ray, PA LAT = posterior to anterior, lateral, AP = anterior to posterior, CTA = computed tomography angiography. Source: The authors
We elected to exclude reports associated with mammography and fetal ultrasounds. Mammography is specifically used to detect cancer, and fetal ultrasounds are performed regularly throughout the course of maternal care; thus, both of these exist within well-established clinical frameworks for regular follow-up and management of relevant findings, reducing the need for our initiative. Finally, stratified sampling across imaging modalities was then performed on this data pull to create an initial data set for annotation, yielding a data set of 33,283 radiology reports. The Findings and Impressions sections of the reports were separated out and reserved to create a database of reports for model development.
Data Annotation Framework
Having obtained the requisite quantity of data for model training, our next task was to annotate it for relevant findings and recommended follow-ups. Third-party services exist for data annotation but have not been validated for this use and are of uncertain applicability to medical data because of the understanding of terminology and clinical context required to identify follow-up recommendations in radiology reports.46 Given the need for accuracy in the clinical setting, we decided to keep the entire annotation process in house to ensure control over the annotator training process, as well as over the quality and timeliness of annotations. Creating a high-quality data set for model development is crucial because any inaccurately labeled data will propagate inaccuracies through the training and evaluation stages of model development, ultimately resulting in degraded clinical performance.
We set up an online system on the internal NM network using the open-source INCEpTION platform for annotation of semantic phenomena47 in which trained clinical nurse annotators labeled curated radiology reports for relevant information (Figure 5).
Figure 5
We enlisted the help of nurses for annotation given their level of clinical expertise, as well as to make the most of a standing cohort of staff placed on light-duty restrictions and unable to fulfill their regular clinical responsibilities. Annotations may be completed by staff working remotely, which was particularly advantageous during the Covid-19 pandemic.
All annotators participated in a standardized training process developed by project leaders. For each report, an annotator specified whether or not a finding requiring follow-up was present. If a finding was present, annotators specified the finding as lung or adrenal related and selected the corresponding recommendation text. Each report was labeled independently by two annotators, and a clinical expert reconciled reports with conflicting labels. Through our platform, we were able to track annotator productivity and accuracy to optimize the annotation process. Annotations were sampled periodically and validated by trained clinical experts to verify annotator performance and identify potential training gaps.
Given the need for accuracy in the clinical setting, we decided to keep the entire annotation process in house to ensure control over the annotator training process, as well as over the quality and timeliness of annotations.
Partway through the annotation process, we recognized that the paucity of radiology reports containing adrenal findings and follow-ups would limit our ability to develop high-performing models. When the ratio of majority-class data (reports without findings) and minority-class data (reports with findings) is lopsided and there are few minority-class examples to learn from (a problem in machine learning termed class imbalance), the model’s ability to generalize to unseen minority-class examples suffers. While no broadly accepted rule of thumb exists that defines what constitutes balanced and imbalanced data, the telltale sign of too few examples in a class is that the model ignores this class entirely. To resolve this, one adds more data of that class until the model stops ignoring the minority class.
Two approaches were used to supplement the training data with more radiology reports that included adrenal-related follow-up recommendations. First, we directly identified radiology reports with an associated International Classification of Diseases code relevant to adrenal findings that could require follow-up. Additionally, we trained a simple XGBoost classifier on already-annotated data to identify adrenal follow-up recommendations with high sensitivity. Reports identified using these strategies were included as an adrenal-rich sample for annotation. The need for enrichment highlights the importance of early and regular evaluation of model performance throughout the annotation process, rather than deferring these activities until the data set is complete.
Ultimately, 36,385 reports were annotated using this system, of which 5,779 (15.9%) contained a lung follow-up recommendation and 409 (1.1%) contained an adrenal follow-up recommendation. Reports containing no findings, lung-related findings, and adrenal-related findings were annotated in an average of 69.8, 93.0, and 165.5 seconds, respectively. Our annotation system has continued to operate for this project and others and, to date, has accrued nearly 400,000 annotated reports spanning the work of 95 annotators over more than 21,000 annotator-hours. Annotated reports were then used for model development, and the same system was used to perform prospective validation and performance monitoring.
Model Development Process
Before developing the NLP models themselves, some preprocessing steps were needed to prepare our data set for the model training process. First, reports were preprocessed to remove extraneous white space. This increases consistency within the data set. All text was also converted to lowercase to meet the GloVe embedding specification. Lowercasing text substantially decreases word-embedding complexity during training and usage but may reduce the information content of capitalized words such as abbreviations; however, this tradeoff did not appear to adversely affect model performance during our initial evaluation. Reports then underwent tokenization, which converts textual data into a numeric machine-readable format that can be used as input to NLP models. This is done by using a dictionary that maps common words or word fragments to unique integer tokens (Figure 3). At this point, we also excluded the 68 reports (0.2%) that contained both lung and adrenal finding annotations to simplify the model training process.
We elected to develop four separate NLP models as part of a pipeline to identify and classify findings with follow-up recommendations (Figure 6).
Figure 6
In the first stage of radiology report screening, the Finding/No Finding BiLSTM model classifies each radiology report as containing or not containing a finding with associated follow-up recommendation. If no follow-up recommendation is detected, no further action is taken. In contrast, if a finding is detected, the report is passed to two models working in parallel. One is an XGBoost model that performs comment extraction to identify the portion of the radiology report containing the relevant finding and recommended follow-up. This model is trained to predict the probability that any given sentence in a radiology report contains a follow-up recommendation and provides the sentence with the maximal predicted probability as output. The other model is a BiLSTM model, which classifies the finding as being either lung related or adrenal related. If the finding is lung-related, then a final BiLSTM model classifies the recommended follow-up procedure as a chest CT or other procedure. Adrenal-related findings, however, frequently require clinical and biochemical evaluation to dictate the need for further imaging and instead trigger a recommended referral to endocrinology. Model development was performed in Python using TensorFlow with the Keras deep-learning library. Code used to train and run the models is available on GitHub.
We enlisted the help of nurses for annotation given their level of clinical expertise, as well as to make the most of a standing cohort of staff placed on light-duty restrictions and unable to fulfill their regular clinical responsibilities.
Models were trained using an 80/20 train/validation split and fivefold cross-validation. Internal validation was performed on a holdout test set of 10,916 annotated reports, and further validation was also performed prospectively (see Prospective Clinical Evaluation section). To ensure consistency in training and validation, data set splitting was performed in the same manner during the development of all models, using a fixed random seed. The accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity were calculated for each classification task along with 95% confidence intervals (CIs). For comparison, we evaluated the baseline regex approach on the same validation data set.
To establish a “human best practice” performance target, we assessed the accuracy, AUC, sensitivity, and specificity of detection performed by a high-performing clinical annotator, comparing this individual’s annotations with reports annotated by multiple experts and finalized by majority decision. This best practice threshold AUC was set at 0.94; on a range of 0.5 to 1.0, an AUC of 0.8 to 0.9 is considered excellent, with 1.0 being perfect.
In the Appendix, Exhibit C summarizes the performance of the NLP models. The Finding/No Finding model identified radiology reports containing a follow-up finding with a weighted AUC of 0.91 (95% CI 0.90–0.92), indicating comparable performance to the high-performing clinical annotator for whom an AUC of 0.94 had been calculated for detecting follow-up recommendations. Additionally, we assessed the performance of the models in conjunction by evaluating the Lung/Adrenal model on the outputs of the Finding/No Finding model, simulating the workflow implementation of these models. The comment extraction model was evaluated on the 1,857 reports containing either a lung or adrenal finding and achieved a Jaccard similarity score of 0.74 (where a score of 0 represents no similarity and 1.0 is 100% similar), indicating high similarity between predicted sentences and actual sentences containing follow-up.
Model Deployment Infrastructure
The model deployment process required collaboration among several teams to integrate machine-learning services with the EHR. A system was implemented within PowerScribe (Nuance Communications, Burlington, MA) and Epic Radiant (Epic Systems, Madison, WI) to automatically preprocess radiology reports associated with a list of approved procedures relevant to the Result Management project upon report signing. Preprocessed and deidentified reports along with a unique report identifier are sent as inputs to the NLP models, which are hosted using Azure Machine Learning cloud services (Microsoft Corporation, Redmond, WA).
This system returns the report identifier in addition to the four NLP model outputs: a flag indicating the presence of a finding, a flag indicating whether the finding was lung or adrenal related, a flag indicating the recommendation of a CT chest study, and any relevant extracted comment text. These results are then made available in the EHR-integrated workflow within 3 minutes of report signing. Given the need to interface among several software systems, rigorous testing was performed at each step to ensure the robustness and security of this architecture.
Clinical Workflow
The Result Management workflow is designed to facilitate effective provider engagement without disrupting existing workflows. For radiologists, the system is completely passive. It requires only that they document in their normal flow: findings requiring follow-up are documented in the Impressions and/or Findings sections of signed radiology reports and screened automatically by our NLP models.
Relevant radiology reports are automatically screened, and within 3 minutes of a positive report resulting, ordering physicians are notified of identified follow-up recommendations via an InBasket alert displaying the type of finding and the radiology report text (Figure 7).
Figure 7
Following this, a clickable Place follow-up orders link opens the Result Management BPA. At this point, physicians can open a SmartSet (a tool aggregating preconfigured groups of orders and documentation elements that are commonly ordered together) to order relevant follow-up or can select an acknowledgment reason to close the workflow loop (Figure 8).
Figure 8
The SmartSet is automatically populated with the appropriate follow-up study, such as CT chest with contrast for lung findings. Once the physician orders the follow-up using the BPA, the EHR can autotrack the completion of the follow-up. If a patient does not complete the ordered follow-up within a specified timeframe, a dedicated team of nurses is alerted and ensures the necessary follow-up is completed.
To address the challenges of the identified follow-up to studies ordered in the ED, the BPA for such encounters is sent to the primary care physician on record. If no primary care physician is identified or the primary care physician is not within our health system, the patient appears on the worklist of the dedicated team of follow-up nurses, who contact the patient to arrange the necessary follow-up.
Partway through the annotation process, we recognized that the paucity of radiology reports containing adrenal findings and follow-ups would limit our ability to develop high-performing models.
During the development of the workflow, special care was taken to identify potential gaps that could result in missed patient follow-up. Two additional elements of workflow were developed.
First, a patient notification was built into the workflow. Seven days after the physician receives the BPA notification, the patient will receive a MyChart letter informing them that a finding possibly requiring follow-up was identified and to reach out to their physician if not already done. If MyChart is not active, a letter will be mailed via the U.S. Postal Service to the address on file. The current letter text is provided in the Appendix, Exhibit D. Because this letter is the sole patient-facing portion of the Result Management system, its language was carefully designed to balance patient concerns with our clinical objectives.
We sought input from three different hospital patient family advisory councils to iteratively improve letter wording to be informative and encourage follow-up while minimizing unwarranted alarm or stress in recipients. One adjustment was made to cancel sending of this letter to oncology patients at the discretion of their oncologist, providing more time for discussions with regularly followed patients. Another change specified the name of the imaging study that generated the finding recommendation. With the previous wording, which used the generic term Radiology testing, we identified the potential for confusion in patients who had recently received multiple imaging studies; for instance, an oncology patient with brain cancer in remission, followed by regular brain MRI studies, who received a notification about a finding generated by a lung nodule discovered on a chest XR and was concerned about recurrence of his cancer. No substantive comments have been received on this updated letter, although the project is evolving, and we continue to actively solicit feedback.
Second, we established an automatic escalation path. If a physician does not interact with a BPA within 21 days of the notification being routed to the InBasket, the patient information will route to the dedicated nurse follow-up team to address. This team first attempts to contact the ordering physician to clarify next steps; if unable to reach the physician, they then contact the patient directly to encourage follow-up with the appropriate provider. This escalation path ensures that patient findings are not lost within busy InBaskets.
Prospective Clinical Evaluation
The entire Result Management workflow went live across the health system in December 2020. Because performance of machine-learning models may differ in development compared with real-world settings, prospective evaluation is a crucial component of clinical integration to ensure that model performance is as expected. Continuous monitoring and evaluation of the Result Management system performance is facilitated by regular reports and an online dashboard aggregating reports flowing through the system and all associated provider interactions. NLP model performance is manually reviewed, and any misclassifications are examined to assess trends in predictions. Table 2 shows the number of radiology reports predicted to contain lung findings over a 13-month validation period, stratified by type of imaging study.
Table 2
Imaging protocol | Total number of reports | Number of reports flagged for lung follow-up (%) |
---|---|---|
CT chest without contrast | 21,861 | 7,217 (33.0) |
CT chest with contrast | 19,938 | 4,427 (22.2) |
CTA chest, pulmonary embolism protocol | 23,851 | 3,420 (14.3) |
CT abdomen pelvis with contrast | 64,256 | 3,370 (5.2) |
XR chest AP portable | 201,880 | 2,231 (1.1) |
XR chest PA LAT | 95,155 | 2,041 (2.1) |
CT chest abdomen pelvis with contrast | 12,556 | 1,520 (12.1) |
CT chest, interstitial lung disease protocol, without contrast | 2,766 | 866 (31.3) |
CT abdomen pelvis without contrast | 15,021 | 698 (4.6) |
CT cardiac calcium score | 2,688 | 328 (12.2) |
This table shows the number of findings flagged for follow-up by the natural language processing (NLP) models for the top 10 imaging modalities contributing lung findings on 13-month clinical validation. CT = computed tomography, CTA = computed tomography angiography, XR = X-ray, AP = anterior to posterior, PA LAT = posterior to anterior, lateral. Source: The authors
Because of early concerns from physician stakeholders that adrenal-related findings detection was not performing at a high enough accuracy relative to lung findings, the portion of the workflow handling adrenal findings was suspended from the clinical workflow in February 2021, although prospective evaluation continued on the back end. This discrepancy in performance may be partly attributable to the data set enrichment used to increase the number of radiology reports containing adrenal findings usable for training; upon evaluation in a real-world data set with a lower incidence of adrenal-related findings, the positive and negative predictive values likely fell accordingly, resulting in more false positives than expected being routed to physician InBaskets. Moreover, the increased time taken to annotate adrenal-related findings (165.5 seconds per report, compared with 93.0 seconds for lung-related findings) suggests that these findings are more challenging even for human experts to identify. Our experience with adrenal-related findings detection highlights the difficulties inherent in translating machine-learning models into real clinical practice, as well as the important role of prospective clinical validation. It also emphasizes the importance of rigorous prospective monitoring once systems are live, no matter how well they performed on retrospective evaluation.
Our annotation system has continued to operate for this project and others and, to date, has accrued nearly 400,000 annotated reports spanning the work of 95 annotators over more than 21,000 annotator-hours.
To prospectively evaluate overall model performance in the clinical setting, a random sample of 5,000 radiology reports screened by the system between December 2020 and December 2021 was annotated independently by two clinical experts for the presence of lung and adrenal findings with recommended follow-up, with any disagreements reconciled by discussion. This sample contained 279 reports with lung findings (5.6%) and 7 reports with adrenal findings (0.1%). Examples of correctly and incorrectly classified radiology reports are given in the Appendix, Exhibit E.
Of the 279 radiology reports with lung findings and follow-up recommendations, 215 were accurately identified by the system, and another 23 reports with no follow-up recommendation were incorrectly predicted to contain a lung follow-up, yielding a sensitivity of 77.1%, specificity of 99.5%, and PPV of 90.3% for lung follow-up identification. This indicated comparable clinical performance of the models for lung-related findings detection, as was expected from our internal validation, although with a somewhat decreased sensitivity.
Of the seven reports containing adrenal findings, two were identified accurately by the system, while nine additional reports with no follow-up recommendation were incorrectly predicted to contain an adrenal follow-up, yielding a sensitivity of 28.6%, specificity of 99.8%, and PPV of 18.2% for adrenal follow-up identification. Given the low prevalence of adrenal findings in this sample, all radiology reports that were flagged by the Result Management models as containing an adrenal-related follow-up recommendation up to April 2021 were additionally reviewed in this manner. Of 695 reports predicted to have adrenal findings, 360 were correct, yielding a PPV of 51.8% for adrenal findings. However, of the 335 false-positive adrenal results, 280 (83.6%) were actually lung findings that required follow-up, suggesting that a significant portion of the incorrect adrenal classifications was attributable to the Lung/Adrenal classifier. Thus, 55 adrenal predictions (7.9%) contained no follow-up finding at all. Misclassifications from the Lung/Adrenal classifier may also have incorrectly flagged some adrenal findings as lung related, although this is an extremely rare occurrence. Moreover, our updated models avoid this problem by performing lung and adrenal classifications independently.
Our focus on PPV in the clinical setting underscores the importance of considering performance metrics beyond the commonly reported sensitivity, specificity, and AUC of machine-learning models performing screening tasks. While high sensitivity is obviously desirable to ensure that as many findings as possible are captured, a high false-positive rate may decrease confidence in the system regardless of the underlying sensitivity of the system for screening. For patients, false-positive results can be a significant source of confusion and anxiety, given the care needed to interpret individual findings and follow-up recommendations in the appropriate clinical context. For physicians, who only interact with the Result Management workflow upon detection of a follow-up recommendation, an insufficiently high PPV may result in added time spent dismissing false-positive results.
These analyses reinforced our decision to remove the adrenal pathway from the clinical workflow but to keep it running in the background to perform prospective analysis and has informed the design of the next version of the Result Management platform, which includes many other recommended follow-ups, including thyroid, hepatic, and ovarian findings. Implementation of new iterations of our NLP models is likely to improve overall performance, and continued model updating will be performed with an eye toward improving clinically important metrics across the spectrum of finding types and imaging modalities. Although the NLP models may not capture every relevant follow-up recommendation due to imperfect sensitivity, we are optimistic that the workflow will improve adherence to follow-ups recommended in the reports that are detected.
Workflow Impact
As of January 24, 2022, in the 13 months since the workflow went live, the NLP models have screened more than 570,000 radiology reports and flagged 29,428 for follow-up, yielding a 5.1% rate for lung findings and an average of 70 reports flagged per day. Before deactivation of the adrenal arm of the workflow, 369 reports were flagged for an adrenal-related follow-up. Moreover, to assess the clinical impact of the EHR-integrated workflow, all physician interactions with the EHR-integrated workflow were logged and monitored. In the same 13-month timeframe, 4,978 uses of the Acknowledge Reasons button on the BPA were recorded, comprising 16.9% of predicted findings. The BPA order set was opened 2,251 times, ultimately resulting in 1,378 lung and nine adrenal-related follow-up procedures ordered, representing a 27.7% rate of ordered follow-up for acknowledged BPAs. The full breakdown of BPA interactions is given in Table 3.
Table 3
Action | Count (%) |
---|---|
Opened SmartSet, no order placed | 887 (17.8) |
Opened SmartSet, lung follow-up placed | 1,378 (27.7) |
Opened SmartSet, adrenal follow-up placed | 9 (0.2) |
Follow-up done outside NM | 65 (1.3) |
Postponed | 975 (19.6) |
Managed by oncology | 904 (18.2) |
Not applicable for patient | 469 (9.4) |
Patient declined | 33 (0.7) |
Transfer responsibility | 258 (5.2) |
Total | 4,978 (100) |
A Best Practice Advisory (BPA) is considered acknowledged if one of the following choices is selected. The most commonly selected acknowledgments that do not result in follow-up order placement are Postpone, which silences the BPA for 24 hours, and Managed by Oncology, which applies to patients who receive regular imaging under the direction of the Oncology department. NM = Northwestern Medicine. Source: The authors
BPAs that were acknowledged but did not result in ordering of follow-up present a workflow improvement opportunity. A large proportion of BPAs were deferred using Postpone, which silences the BPA and resends the InBasket message after 24 hours. Of the 975 postponed BPAs, only 10 (1.0%) resulted in follow-up order placement via SmartSet, which may be a result of the use of this option to mute a BPA deemed not applicable by the clinician. The Not Applicable for Patient acknowledgment presents another potential source of ambiguity, because cases of incorrect classifications, deferred patient decision-making, and other reasons may all fall under this category. We aim to improve the workflow presentation to include acknowledgment reasons that are more intuitive and clinically relevant to physicians. More granular information regarding the alerts that do not lead to ordered follow-up will inform future efforts to reduce unneeded BPAs.
A patient notification was built into the workflow…. Because this letter is the sole patient-facing portion of the Result Management system, its language was carefully designed to balance patient concerns with our clinical objectives.
At the conclusion of the 13-month evaluation period, more than 2,400 patients with a radiology report flagged by the Result Management system had already completed follow-up care, indicating that a significant number of relevant follow-up orders have been placed outside of our workflow. As more patients continue to complete recommended follow-ups, optimization of our workflow for clinical impact remains a top priority. All flagged studies that are not acknowledged continue to be tracked by the follow-up nurse team, who verify that appropriate follow-up is pursued.
The efficacy of the Result Management workflow was compared with that of the previously implemented strategy, in which radiologists manually click a button in the EHR to document the presence of any incidental finding and related follow-up in a radiology report. Because the Result Management workflow requires no interaction from radiologists, the former system continued to operate after Result Management go-live and continues to capture incidental findings, including nonlung, nonadrenal findings.
Of 25,944 reports with follow-up recommendations, 20,884 were flagged by the Result Management system, 4,060 were flagged by the manual system only, and 1,000 were flagged by both. On review of the 4,060 reports not flagged by Result Management, 132 (3.3%) were true misses by the NLP models. The remaining 3,928 studies (96.7%) lie beyond the scope of Result Management: 2,283 were not lung-related findings; 1,512 were identified on imaging procedures not screened by Result Management; and 133 contained lung-related findings but no recommended follow-up. While the small percentage of Result Management misses compared with the radiologist-based system is reassuring, these misclassifications (Appendix, Exhibit F) may represent particularly difficult cases for the NLP models. Among these misclassifications are reports containing grammatical errors or use of abbreviations, which highlight the inherently heterogeneous nature of follow-up recommendations and may have contributed to NLP misclassifications. Inclusion of these reports in future training sets will improve model performance. Altogether, these findings suggest that Result Management is able to capture a greater proportion of follow-up recommendations than an initiative requiring manual input from radiologists.
Workflow Challenges
While interaction with the Result Management system is completely passive for radiologists, it still requires awareness and action on the part of the ordering provider, who must evaluate the radiologist’s recommendation and determine next steps for and with the patient. Clinical uptake of this workflow remains a continued focus of project improvement, given that only 16.9% of findings flagged by the system were acknowledged by ordering physicians. We elected not to enforce alert acknowledgment more stringently to minimize potential alert fatigue, which may be exacerbated by overly intrusive or unnecessary BPAs. However, we identified several cases in which additional follow-up was ordered separately from the BPA workflow and, thus, was not tracked through the Result Management system.
As with any quality improvement implementation, the novelty of the workflow also takes time to become embedded in standard practice. Wording of the Place follow-up order BPA link may also have led physicians to believe that no further acknowledgment of the alert was necessary for patients not requiring follow-up. Workflow design within the EHR continues to be a focus of communication and change management for the system via continued feedback from surveys of and interviews with clinician stakeholders. The volume of imaging studies ordered in the ED for patients without an associated primary care physician also contributes to the low acknowledgment rate, because these patients are excluded from the workflow and are managed directly by the team of follow-up nurses. The low conversion rate of finding detection to BPA acknowledgment presents a substantial challenge to the efficacy of the NLP system and will likely be improved with workflow improvements and greater clinician awareness.
Furthermore, only one-quarter of BPA acknowledgements resulted in the ordering of a follow-up imaging study through the system (Table 3). Because not all follow-up recommendations involve imaging, we expect the follow-up ordering rate to be less than 100%. However, the most common acknowledgments that did not result in a follow-up order indicated Managed by Oncology, which applies to patients who already have established oncological follow-up relating to the finding. Refinement of the workflow to exclude these patients may mitigate these unnecessary alerts. Additionally, a substantial portion of BPA acknowledgments indicated Not Applicable for Patient or opened the BPA order SmartSet without subsequently ordering a follow-up procedure. This may be because the SmartSet does not include a relevant follow-up order or because the physician instead orders the follow-up in a conventional manner, bypassing the BPA workflow. Finally, performance is affected by the imperfect PPV of the NLP models, which necessarily results in acknowledgment of false-positive results. These aspects of the clinical workflow implementation will be addressed to optimize clinical impact, and they underscore the difficulty of process implementation at the scale of a health system.
Of the 335 false-positive adrenal results, 280 (83.6%) were actually lung findings that required follow-up, suggesting that a significant portion of the incorrect adrenal classifications was attributable to the Lung/Adrenal classifier.
We also found it important to carefully delineate the clinical role of the NLP system to clinical stakeholders. Because the NLP system simply identifies radiology reports in which a radiologist has already recommended follow-up, no clinical decision-making is involved. However, we received feedback from clinicians raising concerns about the clinical use of AI — in particular, the possibility that AI was making clinical decisions without physician oversight. Project leaders met with concerned physicians to clarify the role of AI within the workflow, and BPA wording was adjusted to emphasize this point on the basis of stakeholder feedback. While the disruptive nature of AI has raised important ethical and practical considerations regarding the automation of clinical decision-making, the Result Management implementation of clinical AI highlights the utility of AI to facilitate physician decision-making and impact patient care by streamlining burdensome workflows, supporting rather than displacing the physician.
Updating the Models
As part of an iterative model design process, further refinement of the AI models continued after clinical deployment, with the goal of enhancing scalability and model performance. In particular, we sought to capitalize on recent advances in deep-learning NLP techniques to further optimize our modelling strategy.
Attention and Transformers
Recurrent neural networks, such as the BiLSTM models deployed in the Result Management system, process data sequentially and have a limited capacity to preserve context in long sequences of text. Such models may not “remember” information presented early in a sequence of text, such as a radiologist’s description of a lung nodule, by the time relevant text at the end of the sequence, such as a recommended follow-up CT scan, is processed. The BiLSTM was designed to mitigate this factor with its bidirectional, forward-backward text-processing architecture, but it remains bound by the limitations of recurrent neural networks.
Attention is a deep-learning concept that addresses the need to preserve relevant context across sequences. Much like its neuropsychological counterpart, attention provides a mechanism by which a deep-learning model can differentially weight the importance of various parts of input text, using the most relevant portions to perform tasks of interest. Rather than processing data sequentially as do recurrent neural networks, attention-based models can draw upon the entirety of the input data when processing any given word. Transformer models are a powerful deep-learning architecture that implement self-attention, a formulation of attention that enables the model to independently learn to attend to the relevant parts of input data. Since their introduction in 2017,48 transformer-based models have ushered in a paradigm shift in NLP, achieving state-of-the-art performance on language-modeling tasks.49
Another advantage of transformer models over recurrent neural network models is the ability to parallelize the training process, enabling efficient training on large data sets. As a result, strategies have emerged to facilitate the creation of NLP models initially trained using prodigious amounts of data and computational power. These pretrained models can then be distributed and fine-tuned to perform novel tasks with much smaller data and computational requirements. Thus, pretraining frontloads much of the computational expense of model development and enables more efficient development of new models able to maximize information derived during pretraining.
Creating New Models
After evaluating several recently introduced deep-learning NLP model architectures and pretraining strategies, including BERT,50 Bio+ClinicalBERT,51 and ELECTRA,52 we elected to implement RoBERTa,53 an improvement of the BERT transformer architecture, for our updated classification models. RoBERTa uses a pretraining strategy called masked language modeling, in which a fraction of tokens (words or word fragments) within a large text database are masked to a placeholder “[MASK]” token. The NLP model is then trained to predict these masked tokens given the remaining text. This training process endows the model with an understanding of text without the need for manual annotation of data, an example of self-supervised machine learning.
Beyond potential raw improvements in performance, this strategy presented an opportunity for us to enhance the scalability and flexibility of our model development strategy. Pretraining a RoBERTa model on radiology reports from our institution would yield a general language model applicable to all tasks required in the report screening process, in contrast to our initial approach, which used separately trained models to perform individual classification tasks. This pretrained model can then be fine-tuned on our annotated data sets to perform the individual tasks of follow-up detection, comment extraction, and procedure classification.
In consideration of these advantages and the long-term goals of expanding Result Management to include findings beyond lung and adrenal, we also took this opportunity to revise our NLP strategy to combine the finding detection and classification tasks. That is, rather than using a generic Finding/No Finding model that passes reports with findings to a Lung/Adrenal model, these tasks are performed by a single model that classifies reports as Lung/Adrenal/No Finding. This prevents an upstream model’s misclassifications from decreasing the performance of a downstream model, which was a major contributing factor to the poor performance of the original adrenal detection workflow.
Of 25,944 reports with follow-up recommendations, 20,884 were flagged by the Result Management system, 4,060 were flagged by the manual system only, and 1,000 were flagged by both. On review of the 4,060 reports not flagged by Result Management, 132 (3.3%) were true misses by the NLP models.
To create the RoBERTa models, we obtained a data set of more than 10 million radiology reports pulled from the health system EDW for masked language modeling. Models trained in this way were then fine-tuned to perform lung-related findings detection, adrenal-related findings detection, and procedure classification as individual classification tasks. We also refined our modelling strategy to exclude the Findings section of radiology reports, instead using only the Impressions section of radiology reports, given that the vast majority of follow-up recommendations documented by radiologists are noted in the Impressions section. In theory, reducing the amount of text that each model needs to process increases the density of relevant information available to the models, facilitating the learning process by eliminating extraneous information. Indeed, a direct comparison of model performance using the original Findings and Impressions sections and using only the Impressions section showed improved sensitivity and specificity with the latter method.
We also updated the comment extraction model to perform a task called Extractive Question Answering, in which the NLP model is given a question and text and is trained to predict the span of text that answers the question. We evaluated several NLP architectures and selected DistilRoBERTa,54 which is a simplified, faster version of the RoBERTa model, for this task. The DistilRoBERTa model is queried by providing either Lung Findings or Adrenal Findings as the question along with the Impressions section of the radiology report and returns the relevant portion of the radiology report as output.
Performance of the new RoBERTa models is described in the Appendix, Exhibit C. Compared with the BiLSTM models, lung classification was performed with a similar sensitivity and markedly improved specificity, while adrenal classification had an improved sensitivity but lower specificity. Additionally, the comment extraction model achieved a Jaccard similarity score of 0.89, indicating very high agreement with annotated reports and substantially outperforming the original XGBoost model. Clinical deployment of these models is scheduled for March 2022. Clinical performance will be evaluated continually by periodic random sampling of a portion of screened radiology reports, which will be labeled by our trained nurse annotators blinded to model predictions and assessed for accuracy.
Looking Ahead
Tracking of follow-up recommendations in radiology reports to prevent missed or delayed care has great potential to improve patient outcomes and remains a challenging quality improvement proposition at the scale of health systems. Our implementation of the Result Management system demonstrates the value of using AI to automate an otherwise labor-intensive manual task, screening well over 1,000 reports and identifying dozens of follow-up recommendations on a daily basis. Prospective clinical evaluation of the NLP models confirms the potential of AI to provide demonstrable clinical impact.
Our models perform favorably in comparison with previously published methods for finding and follow-up detection,12,25-34 harnessing advances in deep learning and a robust data collection and annotation framework to support model development at scale. Crucially, to our knowledge, no previous study has performed a prospective clinical evaluation of an AI technique for detection of follow-up recommendations.55,56 The scale and complexity of deep-learning methods present challenges to their interpretability relative to simpler models,57 necessitating thorough evaluation throughout the clinical deployment process.58,59 We demonstrate the continued validity of the BiLSTM deep-learning NLP models for the detection of lung-related findings and recommendations when implemented in a clinical setting, although challenges posed by suboptimal adrenal model performance emphasize the importance of prospective validation. Continued evaluation of our updated transformer-based models upon deployment will further characterize the potential of deep-learning advances to translate into clinical benefits.
Moreover, the overwhelming majority of the medical machine-learning literature stops far short of clinical deployment. In our experience, deployment of the Result Management models posed at least as significant a challenge as development of the models themselves, highlighting the challenges associated with implementation of clinical AI tools. Such efforts necessarily require extensive coordination among teams spanning the gamut of health care and IT expertise to become realized. Unexpected challenges relating to model performance, EHR integration, workflow implementation, and clinical uptake were not uncommon and require continued efforts from team members and project stakeholders to optimize the Result Management system for clinical needs, a process that has been likened to building the plane as we fly.
Next Steps
As clinical uptake of the Result Management workflow has been limited, continued efforts are needed to evaluate its impact on patient outcomes. Previous work has demonstrated significant interradiologist variation in rates of follow-up recommendation,6 and documentation practices may also vary systematically among institutions. Further evaluation of our models with a multi-institutional data set is needed to assess generalizability across health care systems beyond our single-institution experience. Moreover, because the NLP models were trained on radiology reports generated over a single year, changing trends in imaging findings and their documentation over time may adversely impact model performance. While our prospective validation supports the temporal robustness of the model, periodic model retraining with more recent data will help to improve model performance over time.
We will continue to build and test new and improved NLP models and clinical workflow adjustments. Review of model misclassifications has allowed for identification of radiology reports that may be particularly difficult to classify and retraining on updated data sets collected as part of this effort will continually improve performance. The dedicated annotation system and EHR infrastructure in place facilitate streamlined model prototyping, evaluation, and deployment. Moreover, current efforts are underway to expand this system to hepatic, thyroid, and ovarian findings requiring follow-up. Finally, as the Result Management system continues to mature and the system tracks more follow-ups to completion, we aim to further characterize its impact on patient outcomes.
Related Content: Watch Mozziyar Etemadi’s talk delving into this In Depth article.
Notes
Jane Domingo, Galal Galal, Jonathan Huang, Priyanka Soni, Vladislav Mukhin, Camila Altman, Tom Bayer, Thomas Byrd, Stacey Caron, Patrick Creamer, Jewell Gilstrap, Holly Gwardys, Charles Hogue, Kumar Kadiyam, Michael Massa, Paul Salamone, Robert Slavicek, Michael Suna, Benjamin Ware, Stavroula Xinos, Lawrence Yuen, Thomas Moran, Cynthia Barnard, James G. Adams, and Mozziyar Etemadi have nothing to disclose.
Appendix
References
1.
Poon EG, Gandhi TK, Sequist TD, Murff HJ, Karson AS, Bates DW. “I wish I had seen this test result earlier!”: dissatisfaction with test result management systems in primary care. Arch Intern Med 2004;164:2223-8 https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/217621.
2.
Giardina TD, King BJ, Ignaczak AP, et al. Root cause analysis reports help identify common factors in delayed diagnosis and treatment of outpatients. Health Aff (Millwood) 2013;32:1368-75 https://www.healthaffairs.org/doi/10.1377/hlthaff.2013.0130.
3.
Wahls TL, Cram PM. The frequency of missed test results and associated treatment delays in a highly computerized health system. BMC Fam Pract 2007;8:32 https://bmcprimcare.biomedcentral.com/articles/10.1186/1471-2296-8-32.
4.
Roy CL, Poon EG, Karson AS, et al. Patient safety concerns arising from test results that return after hospital discharge. Ann Intern Med 2005;143:121-8 https://www.acpjournals.org/doi/10.7326/0003-4819-143-2-200507190-00011.
5.
Gwal K. The Consequences of Miscommunication Regarding a Possible Artifact. Patient Safety Network. Agency for Healthcare Research and Quality. June 30, 2021. Accessed October 9, 2021. https://psnet.ahrq.gov/web-mm/consequences-miscommunication-regarding-possible-artifact.
6.
O’Leary TJ, Ooi TC. The adrenal incidentaloma. Can J Surg 1986;29:6-8.
7.
Singh H, Spitzmueller C, Petersen NJ, Sawhney MK, Sittig DF. Information overload and missed test results in electronic health record-based settings. JAMA Intern Med 2013;173:702-4 https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/1657753.
8.
Callen J, Georgiou A, Li J, Westbrook JI. The safety implications of missed test results for hospitalised patients: a systematic review. BMJ Qual Saf 2011;20:194-9 https://qualitysafety.bmj.com/content/20/2/194.
9.
Smith-Bindman R, Kwan ML, Marlow EC, et al. Trends in use of medical imaging in US health care systems and in Ontario, Canada, 2000-2016. JAMA 2019;322:843-56 https://jamanetwork.com/journals/jama/fullarticle/2749213.
10.
Papanicolas I, Woskie LR, Jha AK. Health care spending in the United States and other high-income countries. JAMA 2018;319:1024-39 https://jamanetwork.com/journals/jama/article-abstract/2674671.
11.
Sistrom CL, Dreyer KJ, Dang PP, et al. Recommendations for additional imaging in radiology reports: multifactorial analysis of 5.9 million examinations. Radiology 2009;253:453-61 https://pubs.rsna.org/doi/10.1148/radiol.2532090200.
12.
Mabotuwana T, Hall CS, Hombal V, et al. Automated tracking of follow-up imaging recommendations. AJR Am J Roentgenol 2019;212:1287-94 https://www.ajronline.org/doi/10.2214/AJR.18.20586.
13.
Kadom N, Doherty G, Solomon A, et al. Safety-net academic hospital experience in following up noncritical yet potentially significant radiologist recommendations. AJR Am J Roentgenol 2017;209:982-6 https://www.ajronline.org/doi/10.2214/AJR.17.18179.
14.
Cochon LR, Kapoor N, Carrodeguas E, et al. Variation in follow-up imaging recommendations in radiology reports: patient, modality, and radiologist predictors. Radiology 2019;291:700-7 https://pubs.rsna.org/doi/10.1148/radiol.2019182826.
15.
Alpert JB, Ko JP. Management of incidental lung nodules: current strategy and rationale. Radiol Clin North Am 2018;56:339-51 https://www.radiologic.theclinics.com/article/S0033-8389(18)30002-2/fulltext.
16.
Hitzeman N, Cotton E. Incidentalomas: initial management. Am Fam Physician 2014;90:784-9 https://www.aafp.org/afp/2014/1201/p784.html.
17.
Mabotuwana T, Hombal V, Dalal S, Hall CS, Gunn M. Determining adherence to follow-up imaging recommendations. J Am Coll Radiol 2018;15(3 Pt A):422-8 https://www.jacr.org/article/S1546-1440(17)31475-8/fulltext.
18.
Feeney T, Talutis S, Janeway M, et al. Evaluation of incidental adrenal masses at a tertiary referral and trauma center. Surgery 2020;167:868-75 https://www.surgjournal.com/article/S0039-6060(19)30586-0/fulltext.
19.
Kwan JL, Yermak D, Markell L, Paul NS, Shojania KG, Cram P. Follow up of incidental high-risk pulmonary nodules on computed tomography pulmonary angiography at care transitions. J Hosp Med 2019;14:349-52 https://shmpublications.onlinelibrary.wiley.com/doi/abs/10.12788/jhm.3128.
20.
Lumbreras B, Donat L, Hernández-Aguado I. Incidental findings in imaging diagnostic tests: a systematic review. Br J Radiol 2010;83:276-89 https://www.birpublications.org/doi/10.1259/bjr/98067945.
21.
Cho JK, Zafar HM, Lalevic D, Cook TS. Patient factor disparities in imaging follow-up rates after incidental abdominal findings. AJR Am J Roentgenol 2019;212:589-95 https://www.ajronline.org/doi/10.2214/AJR.18.20083.
22.
Spruce MW, Bowman JA, Wilson AJ, Galante JM. Improving incidental finding documentation in trauma patients amidst poor access to follow-up care. J Surg Res 2020;248:62-8 https://www.journalofsurgicalresearch.com/article/S0022-4804(19)30811-X/fulltext.
23.
Zafar HM, Bugos EK, Langlotz CP, Frasso R. “Chasing a ghost”: factors that influence primary care physicians to follow up on incidental imaging findings. Radiology 2016;281:567-73 https://pubs.rsna.org/doi/pdf/10.1148/radiol.2016152188.
24.
Sperry JL, Massaro MS, Collage RD, et al. Incidental radiographic findings after injury: dedicated attention results in improved capture, documentation, and management. Surgery 2010;148:618-24 https://linkinghub.elsevier.com/retrieve/pii/S0039606010003855.
25.
Dutta S, Long WJ, Brown DF, Reisner AT. Automated detection using natural language processing of radiologists recommendations for additional imaging of incidental findings. Ann Emerg Med 2013;62:162-9 https://linkinghub.elsevier.com/retrieve/pii/S0196064413001054.
26.
Johnson E, Baughman WC, Ozsoyoglu G. Modeling incidental findings in radiology Records. Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics. September 22, 2013. Accessed February 22, 2022. https://dl.acm.org/doi/10.1145/2506583.2512367.
27.
Mabotuwana T, Hall CS, Tieder J, Gunn ML. Improving quality of follow-up imaging recommendations in radiology. AMIA Annu Symp Proc 2017;2017:1196-204 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5977608/.
28.
Oliveira L, Tellis R, Qian Y, Trovato K, Mankovich G. Follow-up recommendation detection on radiology reports with incidental pulmonary nodules. Stud Health Technol Inform 2015;216:1028 https://ebooks.iospress.nl/publication/40486.
29.
Lou R, Lalevic D, Chambers C, Zafar HM, Cook TS. Automated detection of radiology reports that require follow-up imaging using natural language processing feature engineering and machine learning classification. J Digit Imaging 2020;33:131-6 https://link.springer.com/article/10.1007%2Fs10278-019-00271-7.
30.
Pham AD, Névéol A, Lavergne T, et al. Natural language processing of radiology reports for the detection of thromboembolic diseases and clinically relevant incidental findings. BMC Bioinformatics 2014;15:266 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-266.
31.
Xu Y, Tsujii J, Chang EI. Named entity recognition of follow-up and time information in 20,000 radiology reports. J Am Med Inform Assoc 2012;19:792-9 https://academic.oup.com/jamia/article/19/5/792/716614.
32.
Bala W, Steinkamp J, Feeney T, et al. A web application for adrenal incidentaloma identification, tracking, and management using machine learning. Appl Clin Inform 2020;11:606-16 https://www.thieme-connect.de/products/ejournals/abstract/10.1055/s-0040-1715892.
33.
Carrodeguas E, Lacson R, Swanson W, Khorasani R. Use of machine learning to identify follow-up recommendations in radiology reports. J Am Coll Radiol 2019;16:336-43 https://linkinghub.elsevier.com/retrieve/pii/S1546144018314042.
34.
Lau W, Payne TH, Uzuner O, Yetisgen M. Extraction and analysis of clinically important follow-up recommendations in a large radiology dataset. AMIA Jt Summits Transl Sci Proc 2020;2020:335-44 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7233090/.
35.
Li H. Deep learning for natural language processing: advantages and challenges. Natl Sci Rev 2018;5:24-6 https://academic.oup.com/nsr/article/5/1/24/4107792.
36.
MacMahon H, Naidich DP, Goo JM, et al. Guidelines for management of incidental pulmonary nodules detected on CT images: from the Fleischner Society 2017. Radiology 2017;284:228-43 https://pubs.rsna.org/doi/10.1148/radiol.2017161659.
37.
Harris ZS. Distributional structure. WORD 1954;10:146-62 https://www.tandfonline.com/doi/abs/10.1080/00437956.1954.11659520.
38.
Juluru K, Shih H-H, Keshava Murthy KN, Elnajjar P. Bag-of-words technique in natural language processing: a primer for radiologists. Radiographics 2021;41:1420-6.
39.
Sahlgren M, Cöster R. Using bag-of-concepts to improve the performance of support vector machines in text categorization. COLING 2004: 20th International Conference on Computational Linguistics. Geneva, Switzerland, August 23-27, 2004 https://aclanthology.org/C04-1070.pdf.
40.
Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 2017;30:3149-57 https://dl.acm.org/doi/10.5555/3294996.3295074.
41.
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. KDD ‘16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August 13, 2016. Accessed February 22, 2022. https://dl.acm.org/doi/10.1145/2939672.2939785.
42.
Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. IEEE Comput Intell Mag 2018;13:55-75 https://ieeexplore.ieee.org/document/8416973.
43.
Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 2005;18:602-10 https://www.sciencedirect.com/science/article/abs/pii/S0893608005001206?via%3Dihub.
44.
Pennington J, Socher R, Manning C. GloVe: global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). October 2014. Accessed February 22, 2022. https://aclanthology.org/D14-1162/.
45.
Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 2019;6:52 https://www.nature.com/articles/s41597-019-0055-0.
46.
Rother A, Niemann U, Hielscher T, Völzke H, Ittermann T, Spiliopoulou M. Assessing the difficulty of annotating medical data in crowdworking with help of experiments. PLoS One 2021;16:e0254764 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0254764.
47.
Klie J-C, Bugert M, Boullosa B, Eckart de Castilho R, Gurevych I. The INCEpTION platform: machine-assisted and knowledge-oriented interactive annotation. Proceedings of System Demonstrations of the 27th International Conference on Computational Linguistics (COLING 2018). Santa Fe, NM, August 20-26, 2018 https://aclanthology.org/C18-2002/.
48.
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017). December 2017. Accessed February 22, 2022. https://arxiv.org/pdf/1706.03762.pdf.
49.
Galassi A, Lippi M, Torroni P. Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 2021;32:4291-308 https://ieeexplore.ieee.org/document/9194070.
50.
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. Minneapolis, MN, June 2-7, 2019. https://aclanthology.org/N19-1423/.
51.
Alsentzer E, Murphy JR, Boag W, et al. Publicly Available Clinical BERT Embeddings. June 20, 2019. Accessed February 22, 2022. https://arxiv.org/abs/1904.03323.
52.
Clark K, Luong M-T, Le QV, Manning CD. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. March 23, 2020. Accessed February 22, 2022. https://arxiv.org/abs/2003.10555.
53.
Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. July 26, 2019. Accessed February 22, 2022. https://arxiv.org/abs/1907.11692.
54.
Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper And Lighter. March 1, 2020. Accessed February 22, 2022. https://arxiv.org/abs/1910.01108.
55.
Freeman K, Geppert J, Stinton C, et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 2021;374:n1872 https://www.bmj.com/content/374/bmj.n1872.
56.
Robert M, Driggs D, Thorpe M, et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell 2021;3:199-217 https://www.nature.com/articles/s42256-021-00307-0.
57.
Gilpin LH, Bau D, Yuan BZ, Bajwa A, Specter M, Kagal L. Explaining explanations: an overview of interpretability of machine learning. 5th IEEE International Conference on Data Science and Advanced Analytics (DSAA 2018). Turin, Italy, October 1-3, 2018 https://ieeexplore.ieee.org/document/8631448.
58.
Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ 2020;368:m689 https://www.bmj.com/content/368/bmj.m689.
59.
Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med 2019;17:195 https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-019-1426-2.
Information & Authors
Information
Published In
NEJM Catalyst Innovations in Care Delivery
Copyright
Copyright © 2022 Massachusetts Medical Society.
History
Published online: March 16, 2022
Published in issue: March 16, 2022
Topics
Authors
Metrics & Citations
Metrics
Altmetrics
Citations
Export citation
Select the format you want to export the citation of this publication.