Original ResearchFree Access

Thoracic Imaging

DeepCOVID-XR: An Artificial Intelligence Algorithm to Detect COVID-19 on Chest Radiographs Trained and Tested on a Large U.S. Clinical Data Set

Author Affiliations

Published Online:Nov 24 2020https://doi.org/10.1148/radiol.2020203511

See related article

Abstract

Background

There are characteristic findings of coronavirus disease 2019 (COVID-19) on chest images. An artificial intelligence (AI) algorithm to detect COVID-19 on chest radiographs might be useful for triage or infection control within a hospital setting, but prior reports have been limited by small data sets, poor data quality, or both.

Purpose

To present DeepCOVID-XR, a deep learning AI algorithm to detect COVID-19 on chest radiographs, that was trained and tested on a large clinical data set.

Materials and Methods

DeepCOVID-XR is an ensemble of convolutional neural networks developed to detect COVID-19 on frontal chest radiographs, with reverse-transcription polymerase chain reaction test results as the reference standard. The algorithm was trained and validated on 14 788 images (4253 positive for COVID-19) from sites across the Northwestern Memorial Health Care System from February 2020 to April 2020 and was then tested on 2214 images (1192 positive for COVID-19) from a single hold-out institution. Performance of the algorithm was compared with interpretations from five experienced thoracic radiologists on 300 random test images using the McNemar test for sensitivity and specificity and the DeLong test for the area under the receiver operating characteristic curve (AUC).

Results

A total of 5853 patients (mean age, 58 years ± 19 [standard deviation]; 3101 women) were evaluated across data sets. For the entire test set, accuracy of DeepCOVID-XR was 83%, with an AUC of 0.90. For 300 random test images (134 positive for COVID-19), accuracy of DeepCOVID-XR was 82%, compared with that of individual radiologists (range, 76%–81%) and the consensus of all five radiologists (81%). DeepCOVID-XR had a significantly higher sensitivity (71%) than one radiologist (60%, P < .001) and significantly higher specificity (92%) than two radiologists (75%, P < .001; 84%, P = .009). AUC of DeepCOVID-XR was 0.88 compared with the consensus AUC of 0.85 (P = .13 for comparison). With consensus interpretation as the reference standard, the AUC of DeepCOVID-XR was 0.95 (95% CI: 0.92, 0.98).

Conclusion

DeepCOVID-XR, an artificial intelligence algorithm, detected coronavirus disease 2019 on chest radiographs with a performance similar to that of experienced thoracic radiologists in consensus.

Supplemental material is available for this article.

See also the editorial by van Ginneken in this issue.

Summary

DeepCOVID-XR, an artificial intelligence algorithm for detecting coronavirus disease 2019 on chest radiographs, demonstrated performance similar to that of the consensus of experienced thoracic radiologists.

Key Results

■ DeepCOVID-XR classified 2214 test images (1194 positive for coronavirus disease 2019 [COVID-19]) with an accuracy of 83% and an area under the receiver operating characteristic curve (AUC) of 0.90 compared with the reference standard of reverse-transcription polymerase chain reaction.
■ On 300 random test images (134 positive for COVID-19), accuracy of DeepCOVID-XR was 82% (AUC, 0.88), as compared with five individual thoracic radiologists (accuracy, 76%–81%) and the consensus of all five radiologists (accuracy, 81%; AUC, 0.85).
■ When using consensus interpretation of the radiologists as the reference standard, AUC of DeepCOVID-XR was 0.95.

Introduction

Coronavirus disease 2019 (COVID-19) is responsible for over 40 million cases and over 1.1 million deaths worldwide as of October 22, 2020 (1) and has strained critical health care resources. Although the reference standard for diagnosis of COVID-19 is a reverse-transcription polymerase chain reaction (RT-PCR) assay for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) viral RNA, there are characteristic findings of COVID-19 on chest CT images or chest radiographs. This has inspired multiple efforts at developing an artificial intelligence (AI) algorithm for automated diagnosis of COVID-19 on chest images. However, imaging findings are neither sensitive nor specific enough to be used as a diagnostic tool for COVID-19, and studies that suggest otherwise are limited by selection bias (2–4). Instead, a potential application of AI for chest imaging analysis is for triage or infection control programs in a hospital or emergency department setting to provide early identification of patients with suspicious findings on chest images for further testing and isolation.

Although there have been promising results using AI for detection of COVID-19 on CT images (5–7), the use of CT for this purpose is limited by concerns regarding cost, time, radiation exposure, and decontamination procedures for equipment (8). In contrast, chest radiography can be performed rapidly and at the bedside, involves trivial radiation exposure, and poses a lower risk for viral spread (9,10). However, previously reported AI algorithms for identification of COVID-19 using chest radiography have been limited by small data sets and the use of publicly available images of variable quality and questionable validity (11–16). Here, we present DeepCOVID-XR, a deep learning algorithm to detect chest radiographs suspicious for COVID-19. DeepCOVID-XR was trained and tested on a large data set of clinical images from a major U.S. health care system (to our knowledge, the largest clinical data set of chest radiographs from the COVID-19 era used to train a published AI platform to date). In this study, we compare the performance of the DeepCOVID-XR algorithm with interpretations by experienced thoracic radiologists.

Materials and Methods

Patients

This retrospective study was approved by the Northwestern institutional review board (STU00212323) and was granted a waiver of Health Insurance Portability and Accountability Act authorization and a waiver of written informed consent. Our study sample included consecutive patients from over 20 sites (including hospitals, stand-alone emergency departments, and urgent care facilities) (Table E1 [online]) across the Northwestern Memorial Health Care System who were tested for COVID-19 from February 2020 to April 2020. Patients included adults aged at least 18 years with a documented RT-PCR test result for SARS-CoV-2 (whether positive or negative); a diagnosis of COVID-19 as defined by the International Classification of Diseases, Tenth Revision (ICD-10) code; or a COVID-19 definitive positive flag in the electronic health record. COVID-19 positivity was defined as any one positive RT-PCR result for SARS-CoV-2 during the associated clinical encounter (eg, a patient with multiple RT-PCR tests and only one positive result would be considered positive); a diagnosis of COVID-19 as defined by the ICD-10 code; or a COVID-19 definitive positive flag in the electronic health record (most patients with a diagnosis indicated only by the ICD-10 code or a COVID-19 definitive positive flag in the electronic health record had a prior documented positive RT-PCR test for SARS-CoV-2 at an institution outside the Northwestern Memorial Health Care System). Patients with only negative documented RT-PCR test results for SARS-CoV-2 during their clinical encounter were labeled as COVID-19 negative.

Image Labeling and Data Set Partitioning

Every chest radiograph obtained during the study period in patients who met inclusion criteria was included, regardless of quality. All chest radiographs acquired during a given clinical encounter were labeled as positive or negative for COVID-19 based on the previously mentioned patient-level criteria, regardless of the timing of chest radiographic findings compared with RT-PCR results. Images were filtered to include only frontal projections (ie, bedside anteroposterior images and only posteroanterior images from posteroanterior or lateral acquisitions).

Images from Northwestern Memorial Health Care’s major academic teaching hospital, Northwestern Memorial Hospital, were combined with those from other sites, with the exception of images from a single community hospital, Lake Forest Hospital, which were held out as a test set that the algorithm was never exposed to during training or validation. Images from Northwestern Memorial Hospital and the other sites were then split into training and validation sets in an 80%-20% fashion (while ensuring no crossover of patients among groups). Figure 1 shows the breakdown of training, validation, and test data sets.

Figure 1: Flowchart for patient inclusion in the study and breakdown of training, validation, and hold-out test data sets. COVID-19 = coronavirus disease 2019, CXR = chest radiography, EHR = electronic health record, LFH = Lake Forest Hospital, NMH = Northwestern Memorial Hospital, RT-PCR = reverse-transcription polymerase chain reaction

DeepCOVID-XR: An Ensemble of Deep Neural Networks for COVID-19 Prediction

Details regarding image preprocessing; the architecture of the deep convolutional neural network (CNN) ensemble model; algorithm training, validation, and testing; and saliency heatmap generation are provided in Appendix E1 (online). Briefly, DeepCOVID-XR is a weighted ensemble of deep neural networks (Fig 2). Every image in the data set is first preprocessed to produce four separate images (cropped and uncropped images at resolutions of 224 × 224 pixels and 331 × 331 pixels). Each image is then fed into six previously validated CNN architectures—DenseNet-121 (17), ResNet-50 (18), InceptionV3 (19), Inception-ResNetV2 (20), Xception (21), and EfficientNet-B2 (22)—for a total of 24 individually trained CNNs that served as members of the deep learning model ensemble. The CNNs in this ensemble were pretrained on a large publicly available data set of over 100 000 chest radiographs from the National Institutes of Health (23) and were then fine-tuned on our clinical training set of chest radiographs from the COVID-19 era using transfer learning. The validation data set was used to optimize hyperparameters. The final binary prediction of the neural network architecture was a weighted average of the predictions of these individual CNNs, classifying images as either COVID-19 positive or COVID-19 negative using an output threshold of greater than 0.5 (on a scale from 0 to 1). Gradient class activation mapping (24) was used to produce heatmaps to visualize feature importance for arriving at a prediction of COVID-19 positivity. Our code base, including trained weights for each of the 24 individual neural network architectures and their respective model weights for the weighted ensemble, is provided freely on GitHub at https://github.com/IVPLatNU/deepcovidxr.

Figure 2: Schematic shows the general architecture of the DeepCOVID-XR deep learning ensemble model. Images are initially preprocessed to crop a square centered on the lungs, and then both uncropped and cropped images are downsampled to two different resolutions (224 × 224 pixels and 331 × 331 pixels) before being fed into each of six different previously validated convolutional neural network architectures (four images × six architectures = 24 models total). The predictions from each individual model are then ensembled using a weighted average to produce a single prediction of coronavirus disease 2019 (COVID-19) positivity or COVID-19 negativity for each image.

Comparison with Experienced Thoracic Radiologist Interpretations

Three hundred images were selected at random from the hold-out test data set (ensuring only one image per patient and no patient overlap with training or validation sets) for expert interpretation. Expert interpretations were independently provided by five radiologists: four board-certified thoracic radiologists (R.A., N.P., H.S., and B.A.) with 8 years, 6 years, 6 years, and 1 year of posttraining experience, respectively, and one board-certified diagnostic radiologist (G.M.) with 38 years of posttraining experience. Radiologists were blinded to any identifying information or clinical characteristics and had access to the full radiologic study in our picture archiving and communication system (ie, radiologists were able to review lateral images for posteroanterior or lateral studies). Radiologists provided an overall interpretation of positive for COVID-19 or negative for COVID-19 (chest radiographs with abnormalities that were not deemed consistent with COVID-19 were graded as negative for COVID-19) and an associated confidence level with this assessment (graded as low, medium, or high). In this way, we derived a six-point scoring system ranging from −3 (high confidence in COVID-19–negative interpretation) to +3 (high confidence in COVID-19–positive interpretation). In addition to the overall interpretation, this six-point scoring system was used to calculate five separate decision thresholds for each radiologist for comparison to the algorithm. Finally, a consensus interpretation for the five radiologists was determined by taking the majority vote (mode) of the individual interpretations, and receiver operating characteristic curves for the consensus interpretation were produced by calculating an average of the six-point scores for all radiologists on each image.

Statistical Analysis

For comparison of DeepCOVID-XR and radiologist interpretations, 95% CIs were produced for sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) (using 2000 bootstrap samples). Sensitivity and specificity were compared using the McNemar test for paired samples (25), and AUCs were compared using the DeLong test (26). A two-tailed P value of .05 was considered to indicate statistical significance. Statistical analyses were performed using the packages DTComPair and pROC in R, version 3.6 (R Core Team; R Foundation for Statistical Computing).

Results

Patient Characteristics

A total of 5853 patients (mean age, 58 years ± 19 [standard deviation]; 3101 women, 1782 with positive results for COVID-19) were evaluated across data sets (Table 1). The rate of positivity for COVID-19 among chest radiographs in the hold-out test set (1192 of 2214 [54%]) was higher than in the training (3390 of 11 786 [29%]) and validation (863 of 2992 [29%]) sets. The proportion of patients with positive COVID-19 findings who underwent inpatient treatment (237 of 324 [75%]) was higher in the test set than in the training (719 of 1142; 63%) or validation (216 of 324; 67%) sets. Additionally, there was a higher proportion of anteroposterior images (97% [2141 of 2214]) in the test set than in the training (86% [10 200 of 11,786]) and validation (84% [2502 of 2992]) sets. In the test set, 263 of 1192 (22%) COVID-19–positive imaging findings were acquired prior to positive RT-PCR results.

**Table 1: Baseline Characteristics of Patients**

Performance of DeepCOVID-XR

A performance comparison of individual model architectures and ensemble models is provided in the supplemental materials (Tables E2, E3 [online]). In the hold-out test set of 2214 images, the overall accuracy of DeepCOVID-XR for predicting COVID-19 was 83% (1846 of 2214), with a sensitivity of 75% (898 of 1192), a specificity of 93% (948 of 1022), and an AUC of 0.90 (confusion matrix in Fig 3a, receiver operating characteristic curve in Fig 3b). Notably, 156 of 1192 (13%) of COVID-19–positive imaging findings were acquired prior to RT-PCR results and were accurately labeled by the algorithm as suspicious for COVID-19. As approximately 5% (44 patients contributing 151 images) of patients in the hold-out test set also underwent chest radiography at one of the institutions in our training or validation sets during the study period, we performed a sensitivity analysis in which these images were dropped from the test set and the results were unchanged (Table E4 [online]). The most representative images from each class of DeepCOVID-XR predictions (true-positive, true-negative, false-positive, and false-negative predictions) are provided in Figure 4. Gradient class activation mapping heatmaps of feature importance for individual chest radiographs are provided in Figure 5. Heatmaps for COVID-19–positive imaging findings highlighted features in the lungs that identified areas of abnormalities (Fig 5a–5c), in contrast to the heatmaps for COVID-19–negative imaging findings (Fig 5d).

Figure 4: Sample of most representative images from different classes of DeepCOVID-XR predictions relative to the reference standard. Images classified as positive for coronavirus disease 2019 (COVID-19) by the algorithm (whether true-positive or false-positive findings) tend to have the typical features of COVID-19 pneumonia described in the literature, including patchy bilateral consolidations and ground-glass opacities with peripheral and lower-lung predominance. Images classified as negative for COVID-19 by the algorithm tend to show clear lungs, concomitant pleural effusions, or both, which are reported to be rare in COVID-19 pneumonia.

Gradient class activation mapping (Grad-CAM) heatmaps of feature importance for prediction of a positive coronavirus disease 2019 (COVID-19) finding. Generated heatmaps appropriately highlight abnormalities in the lungs in (a–c) those images accurately labeled as depicting positive COVID-19 findings, in contrast to (d) images that were accurately labeled as depicting negative COVID-19 findings. The intensity of colors on the heatmap corresponds to features of the image that are important for prediction of COVID-19 positivity. — Figure 5a: Gradient class activation mapping (Grad-CAM) heatmaps of feature importance for prediction of a positive coronavirus disease 2019 (COVID-19) finding. Generated heatmaps appropriately highlight abnormalities in the lungs in **(a–c)** those images accurately labeled as depicting positive COVID-19 findings, in contrast to **(d)** images that were accurately labeled as depicting negative COVID-19 findings. The intensity of colors on the heatmap corresponds to features of the image that are important for prediction of COVID-19 positivity.

Comparison with Expert Thoracic Radiologists

A comparison of the performance of DeepCOVID-XR with expert chest radiologist interpretations of 300 patients’ chest radiographs (134 with positive COVID-19 results) randomly selected from the hold-out test set is provided in Table 2. The overall accuracy of DeepCOVID-XR on this test set was 82% (247 of 300) compared with the reference standard of RT-PCR, whereas the accuracy of individual radiologists ranged from 76% (227 of 300) to 81% (242 of 300) and the accuracy of the consensus interpretation of all five radiologists was 81% (242 of 300). DeepCOVID-XR had a significantly higher specificity (92%, 152 of 166) than two of the radiologists (75% [125 of 166], P < .001; 84% [139 of 166], P = .009) and significantly higher sensitivity (71% [95 of 134]) than one radiologist (60% [81 of 134]; P < .001). A comparison of the receiver operating characteristic curve for DeepCOVID-XR with overall individual radiologist interpretations is provided in Figure 6a. A comparison of DeepCOVID-XR with individual radiologists on each of the five decision thresholds derived from the six-point scoring system is provided in Table E5 (online) and Figure E1 (online).

**Table 2: Performance of DeepCOVID-XR on Random Sample of 300 Images from Test Set Compared with Expert Thoracic Radiologists’ Interpretations and Consensus Radiologist Interpretation**

Comparison with expert radiologist interpretations. (a) Comparison of the performance of DeepCOVID-XR with individual expert radiologist interpretations on random sample of 300 images from the test set. For DeepCOVID-XR, the receiver operating characteristic (ROC) curve (red line) and decision threshold for overall interpretation of coronavirus disease 2019 (COVID-19) positivity or negativity (red point) is plotted with a 95% CI (red shaded area). For each radiologist, the overall interpretation sensitivity (Sens) and specificity (Spec) is plotted with 95% CIs (dashed lines). Radiologist 1 (Rad1) = blue square, radiologist 2 (Rad2) = gray downward arrow, radiologist 3 (Rad3) = green upward arrow, radiologist 4 (Rad4) = cyan diamond, radiologist 5 (Rad5) = magenta X. (b) Comparison of DeepCOVID-XR with the consensus interpretation of all five radiologists. ROC curves (lines) and decision thresholds (points) for DeepCOVID-XR (red) and the consensus interpretation (purple) (area under the ROC curve [AUC], 0.88 vs 0.85; P = .13). (c) ROC curve shows performance of DeepCOVID-XR (red line) using the consensus interpretation of all five radiologists rather than real-time polymerase chain reaction (RT-PCR) as the radiologic reference standard. — Figure 6a: Comparison with expert radiologist interpretations. **(a)** Comparison of the performance of DeepCOVID-XR with individual expert radiologist interpretations on random sample of 300 images from the test set. For DeepCOVID-XR, the receiver operating characteristic (ROC) curve (red line) and decision threshold for overall interpretation of coronavirus disease 2019 (COVID-19) positivity or negativity (red point) is plotted with a 95% CI (red shaded area). For each radiologist, the overall interpretation sensitivity (Sens) and specificity (Spec) is plotted with 95% CIs (dashed lines). Radiologist 1 (Rad1) = blue square, radiologist 2 (Rad2) = gray downward arrow, radiologist 3 (Rad3) = green upward arrow, radiologist 4 (Rad4) = cyan diamond, radiologist 5 (Rad5) = magenta X. **(b)** Comparison of DeepCOVID-XR with the consensus interpretation of all five radiologists. ROC curves (lines) and decision thresholds (points) for DeepCOVID-XR (red) and the consensus interpretation (purple) (area under the ROC curve [AUC], 0.88 vs 0.85; P = .13). **(c)** ROC curve shows performance of DeepCOVID-XR (red line) using the consensus interpretation of all five radiologists rather than real-time polymerase chain reaction (RT-PCR) as the radiologic reference standard.

The AUC for DeepCOVID-XR was 0.88 (range, 0.84–0.92) compared with 0.85 (range, 0.80–0.89; P = .13 for comparison) for the consensus interpretation of all five radiologists (Fig 6b). When using the consensus interpretation rather than the RT-PCR assay as the reference standard, the AUC for DeepCOVID-XR was 0.95 (range, 0.92–0.98; Fig 6c). The time to analyze this subset of 300 images with DeepCOVID-XR on one NVIDIA Titan V graphics processing unit was approximately 18 minutes compared with approximately 2.5–3.5 hours for each expert radiologist.

Discussion

In this study, we present DeepCOVID-XR, an ensemble deep learning artificial intelligence (AI) algorithm to detect coronavirus disease 2019 (COVID-19) on chest radiographs. DeepCOVID-XR was trained and tested on, to our knowledge, the largest clinical data set of chest radiographs from the COVID-19 era of any other published AI platform to date, including images from multiple institutions across a large U.S. health care system (17 002 images from 5853 patients total). Of note, study patients were representative of a real-world population of patients presenting for emergency or inpatient care in the COVID-19 era; a proportion of patients with negative COVID-19 test results likely had a spectrum of non–COVID-19–related abnormalities on chest radiographs (including confounding variables like non–COVID-19 viral pneumonia) that one would expect in this patient population. On a hold-out test data set of 2214 images (1192 positive for COVID-19) from a single institution that the algorithm was not exposed to during model development, DeepCOVID-XR detected COVID-19 with an overall accuracy of 83% (sensitivity of 75%, specificity of 93%) and an area under the receiver operating characteristic curve (AUC) of 0.90. Additionally, on a random sample of 300 test images, the accuracy of DeepCOVID-XR was 82% compared with 76%–81% for individual experienced thoracic radiologists and 81% for the consensus interpretation of all five radiologists. Finally, the AUC for DeepCOVID-XR was 0.88 compared with 0.85 for the consensus interpretation (P = .13). When using the consensus radiologist interpretation rather than real-time polymerase chain reaction as the reference standard, the AUC for DeepCOVID-XR was 0.95, suggesting a discriminative ability of our algorithm similar to that of a consensus of experts.

Errors made by the algorithm were explainable. Images categorized as demonstrating positivity (whether true or false positivity) often depicted characteristic features of COVID-19 viral pneumonia that have previously been reported in the literature, including bilateral consolidations and ground-glass opacities with lower lung and peripheral predominance (27,28). In contrast, images categorized as demonstrating negativity by the algorithm (whether true or false negativity) often depicted clear lungs, concomitant pleural effusions, or both; interestingly, pleural effusions have been previously found to be quite rare (3%) in COVID-19–related pneumonia (27). Visualization of feature importance using gradient class activation mapping heatmaps revealed abnormalities in the lungs to be highly predictive of COVID-19 on chest radiographs, as expected. This serves as an important sanity check to reinforce confidence in algorithm predictions.

The explainable errors in algorithm predictions likely represent limitations of chest imaging in the radiologic diagnosis of COVID-19 rather than limitations of the algorithm itself. Prior clinical studies showed COVID-19 pneumonia produces characteristic features on chest images, but up to 56% of symptomatic patients can have normal findings on chest images, especially early in their disease course (9,27,29–31). Thus, imaging is inappropriate to rule out disease. In addition, many of the COVID-19 imaging findings are nonspecific and may overlap with findings characteristic of other conditions, particularly other viral pneumonias (32). Chest imaging should therefore not be used as a diagnostic tool for COVID-19 but could play an important role in earlier identification of patients likely to have the disease to aid in triage and infection control. Interestingly, Wong et al (27) found that approximately 9% of patients had abnormal imaging findings before they received a positive RT-PCR result, a proportion similar to the 13% (156 of 1192) of COVID-19–positive images in our study obtained prior to the patient’s positive RT-PCR result and flagged as depicting COVID-19 positivity by our algorithm.

A number of groups from industry and academic sectors have published studies and non–peer-reviewed preprint reports with claims of extremely high sensitivity and specificity of AI algorithms in the detection of COVID-19 on chest radiographs (11–14). However, most of these studies have been limited by small sample sizes or have relied on images from publicly available data sets containing a mix of images from research articles and clinical reports of variable quality and questionable image-label accuracy (15). These data sets are subject to significant bias (33) and are simply not sufficient to train an algorithm ultimately intended for clinical use.

Murphy et al (16) presented a deep learning algorithm for detection of COVID-19 on chest radiographs that included both pre–COVID-19 era images for model pretraining and a data set composed of 606 clinical images for training and 468 clinical images for testing from patients at two Dutch hospitals during the COVID-19 era. The authors used a commercial patch-based CNN called CAD4COVID–x-ray, which had an AUC of 0.81 for predicting COVID-19 on a hold-out test set from one institution. By contrast, our model was trained and tested on more than 15 times the number of clinical images from the COVID-19 era. The discriminative performance of DeepCOVID-XR on an independent hold-out test set was superior to that reported for CAD4COVID–x-ray (AUC, 0.90 vs 0.81). Although differences in patient populations may partly account for this difference in performance, the AUC of our algorithm (ie, 0.95) when compared with a consensus of radiologists was far higher than that reported by Murphy et al (AUC range, 0.81–0.86), suggesting that DeepCOVID-XR more reliably produces predictions in line with the ground-truth radiologic diagnosis as determined by a consensus of experts. Finally, although the authors made their software available for use through a cloud-based interface, no details regarding algorithm development or code were made available, given the proprietary nature of their platform. We are freely providing all our code and pretrained neural network and model ensemble weights for open-source use in an effort to move toward a democratized approach to model development (https://github.com/IVPLatNU/deepcovidxr).

Our study had some limitations. First, the algorithm was evaluated on only those patients who were tested for COVID-19; thus, there was likely some degree of selection bias. Second, the performance of our algorithm was compared with that of RT-PCR assays as a reference standard, which has somewhat limited sensitivity itself because of sampling error or viral mutation (34). Finally, it is unclear how well the algorithm performs when COVID-19 is not the dominant viral pneumonia, as the study was performed at a time of considerable case load in our health care system.

In conclusion, DeepCOVID-XR is a deep learning artificial intelligence (AI) algorithm that detects coronavirus disease 2019 (COVID-19) on chest radiographs. The algorithm was trained and tested on a large U.S. clinical data set with performance similar to that of the consensus interpretation of experienced thoracic radiologists. We feel that this algorithm has the potential to benefit health care systems in mitigating unnecessary exposure to the virus by serving as an automated tool to rapidly flag patients with suspicious chest imaging findings for isolation and further testing. Planned future studies include a prospective evaluation of the algorithm (including in those patients not suspected of having COVID-19), a necessity for any AI algorithm prior to clinical implementation. In addition, we plan to incorporate other clinical data (eg, demographics, vital signs, laboratory data) into the algorithm to further boost the performance and adapt the algorithm for risk prediction of clinically meaningful outcomes in patients with confirmed COVID-19. By providing the DeepCOVID-XR algorithm code base as an open-source resource, we hope investigators around the world will further improve, fine-tune, and test the algorithm using clinical images from their own institutions.

Disclosures of Conflicts of Interest: R.M.W. disclosed no relevant relationships. J.S. disclosed no relevant relationships. S.D. disclosed no relevant relationships. S.C. disclosed no relevant relationships. A.D. disclosed no relevant relationships. S.B. disclosed no relevant relationships. Y.W. disclosed no relevant relationships. D.R.C. disclosed no relevant relationships. N.X. disclosed no relevant relationships. B.D.A. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: received personal fees from Tempus Labs. Other relationships: disclosed no relevant relationships. G.A.M. disclosed no relevant relationships. H.S. disclosed no relevant relationships. R.A. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: is a speaker for Boehringer Ingelheim. Other relationships: disclosed no relevant relationships. N.P. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: is a speaker for Boehringer Ingelheim. Other relationships: disclosed no relevant relationships. A.K.K. disclosed no relevant relationships.

Acknowledgments

The authors acknowledge Scott M. Leonard for his assistance with transferring deidentified studies to our picture archiving and communications system for review by radiologists.

Author Contributions

Author contributions: Guarantors of integrity of entire study, R.M.W., Y.W., D.R.C., N.P., A.K.K.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, R.M.W., J.S., S.D., S.C., A.D., S.B., Y.W., N.X., H.S., N.P., A.K.K.; clinical studies, R.M.W., S.C., Y.W., D.R.C., N.X., B.D.A., H.S., N.P.; statistical analysis, R.M.W., J.S., S.D., S.C., A.D., S.B., Y.W., D.R.C., A.K.K.; and manuscript editing, R.M.W., J.S., S.D., S.C., A.D., S.B., Y.W., D.R.C., N.X., H.S., N.P., A.K.K.

References

1. Gardner L, Ensheng D. Johns Hopkins Center for Systems Science and Engineering (CSSE) COVID-19 dashboard. Johns Hopkins University Website. https://coronavirus.jhu.edu/map.html. Published 2020. Accessed August 11, 2020. Google Scholar
2. Weinstock MB, Echenique A, Russell JW, et al. Chest x-ray findings in 636 ambulatory patients with COVID-19 presenting to an urgent care center: a normal chest x-ray is no guarantee. J Urgent Care Med 2020;14(7):13–18. Google Scholar
3. Hope MD, Raptis CA, Shah A, Hammer MM, Henry TS; Six Signatories. A role for CT in COVID-19? What data really tell us so far. Lancet 2020;395(10231):1189–1190. Crossref, Medline, Google Scholar
4. Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of COVID-19 infection: systematic review and critical appraisal. BMJ 2020;369:m1328 [Published correction appears in BMJ 2020;369:m2204.]. Crossref, Medline, Google Scholar
5. Shi F, Wang J, Shi J, et al. Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for COVID-19. IEEE Rev Biomed Eng 2020. 10.1109/RBME.2020.2987975. Published online April 16, 2020. Accessed July 14, 2020. Google Scholar
6. Li L, Qin L, Xu Z, et al. Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy. Radiology 2020;296(2):E65–E71. Link, Google Scholar
7. Huang L, Han R, Ai T, et al. Serial quantitative chest CT assessment of COVID-19: deep-learning approach. Radiol Cardiothorac Imaging 2020;2(2):e200075. Link, Google Scholar
8. ACR Recommendations for the use of chest radiography and computed tomography (CT) for suspected COVID-19 infection. American College of Radiology Web site. https://www.acr.org/Advocacy-and-Economics/ACR-Position-Statements/Recommendations-for-Chest-Radiography-and-CT-for-Suspected-COVID19-Infection. Updated March 22, 2020. Accessed August 11, 2020. Google Scholar
9. Cleverley J, Piper J, Jones MM. The role of chest radiography in confirming COVID-19 pneumonia. BMJ 2020;370:m2426. Crossref, Medline, Google Scholar
10. Jacobi A, Chung M, Bernheim A, Eber C. Portable chest x-ray in coronavirus disease-19 (COVID-19): a pictorial review. Clin Imaging 2020;64:35–42. Crossref, Medline, Google Scholar
11. Yi PH, Kim TK, Lin CT. Generalizability of deep learning tuberculosis classifier to COVID-19 chest radiographs: new tricks for an old algorithm? J Thorac Imaging 2020;35(4):W102–W104. Crossref, Medline, Google Scholar
12. Oh Y, Park S, Ye JC. Deep learning COVID-19 features on CXR using limited training data sets. IEEE Trans Med Imaging 2020;39(8):2688–2700. Crossref, Medline, Google Scholar
13. Castiglioni I, Ippolito D, Interlenghi M, et al. Artificial intelligence applied on chest x-ray can aid in the diagnosis of COVID-19 infection: a first experience from Lombardy, Italy. medRxiv 2020. Google Scholar
14. Wang L, Zhong QL, Wong A. COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest x-ray images. ArXiv 2003.09871 [preprint] https://arxiv.org/abs/2003.09871. Posted March 22, 2020. Accessed July 29, 2020. Google Scholar
15. Cohen JP, Morrison P, Dao L. COVID-19 image data collection. ArXiv 2003.11597 [preprint] https://arxiv.org/abs/2003.11597. Posted March 25, 2020. Accessed July 19, 2020. Google Scholar
16. Murphy K, Smits H, Knoops AJG, et al. COVID-19 on chest radiographs: a multireader evaluation of an artificial intelligence system. Radiology 2020;296(3):E166–E172. Link, Google Scholar
17. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the 30th IEEE conference on computer vision and pattern recognition. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2017; 2261–2269 . Crossref, Google Scholar
18. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the 29th IEEE conference on computer vision and pattern recognition. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2016; 770–778 https://doi.org/10.1109/CVPR.2016.90. Crossref, Google Scholar
19. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the 29th IEEE conference on computer vision and pattern recognition. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2016; 2818–2826 https://doi.org/10.1109/CVPR.2016.308. Crossref, Google Scholar
20. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. Menlo Park, CA: Association for the Advancement of Artificial Intelligence, 2017; 4278–4284. Google Scholar
21. Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the 30th IEEE conference on computer vision and pattern recognition. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2017; 1800–1807. Crossref, Google Scholar
22. Tan M, Le QV. EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th international conference on machine learning. Cambridge, MA: PMLR, 2019;97:6105–6114. Google Scholar
23. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2017; 3462–3471 https://doi.org/10.1109/CVPR.2017.369. Crossref, Google Scholar
24. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the 2017 IEEE international conference on computer vision. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2017; 618–626 https://doi.org/10.1109/ICCV.2017.74. Crossref, Google Scholar
25. McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 1947;12(2):153–157. Crossref, Medline, Google Scholar
26. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988;44(3):837–845. Crossref, Medline, Google Scholar
27. Wong HYF, Lam HYS, Fong AHT, et al. Frequency and distribution of chest radiographic findings in patients positive for COVID-19. Radiology 2020;296(2):E72–E78. Link, Google Scholar
28. Ng MY, Lee EY, Yang J, et al. Imaging profile of the COVID-19 infection: radiologic findings and literature review. Radiol Cardiothorac Imaging 2020;2(1):e200034. Link, Google Scholar
29. Vancheri SG, Savietto G, Ballati F, et al. Radiographic findings in 240 patients with COVID-19 pneumonia: time-dependence after the onset of symptoms. Eur Radiol 2020;30(11):6161–6169. Crossref, Medline, Google Scholar
30. Salehi S, Abedi A, Balakrishnan S, Gholamrezanezhad A. Coronavirus disease 2019 (COVID-19): a systematic review of imaging findings in 919 patients. AJR Am J Roentgenol 2020;215(1):87–93. Crossref, Medline, Google Scholar
31. Bernheim A, Mei X, Huang M, et al. Chest CT findings in coronavirus disease-19 (COVID-19): relationship to duration of infection. Radiology 2020;295(3):200463. Link, Google Scholar
32. Bai HX, Hsieh B, Xiong Z, et al. Performance of radiologists in differentiating COVID-19 from non-COVID-19 viral pneumonia at chest CT. Radiology 2020;296(2):E46–E54. Link, Google Scholar
33. DeGrave AJ, Janizek JD, Lee SI. AI for radiographic COVID-19 detection selects shortcuts over signal. MedRxiv 2020.09.13.20193565 2020 [preprint] https://doi.org/10.1101/2020.09.13.20193565. Posted October 8, 2020. Accessed September 20, 2020. Google Scholar
34. Watson J, Whiting PF, Brush JE. Interpreting a COVID-19 test result. BMJ 2020;369(May):m1808. Google Scholar

Article History

Received: Aug 21 2020
Revision requested: Sept 30 2020
Revision received: Oct 26 2020
Accepted: Oct 30 2020
Published online: Nov 24 2020
Published in print: Apr 2021

Vol. 299, No. 1

Supplemental Material

Abbreviations

Abbreviations:
AI	artificial intelligence
AUC	area under the receiver operating characteristic curve
CNN	convolutional neural network
COVID-19	coronavirus disease 2019
RT-PCR	reverse-transcription polymerase chain reaction
SARS-CoV-2	severe acute respiratory syndrome coronavirus 2

Metrics

Altmetric Score

Open Access

This article is made available via the PMC Open Access Subset for unrestricted re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the COVID-19 pandemic or until permissions are revoked in writing. Upon expiration of these permissions, PMC is granted a perpetual license to make this article available via PMC and Europe PMC, consistent with existing copyright protections.

PDF download