Introduction

The coronavirus disease-2019 (COVID-19) outbreak is currently straining numerous health systems throughout the world. Timely identification of SARS-CoV-2 positive patients is proving to be the key to contain the spread of the virus effectively. Reverse transcriptase-polymerase chain reaction (RT-PCR) is considered the reference standard for the diagnosis of COVID-19. However, a not negligible number of initial false-negative results have been reported [1,2,3], and many areas throughout the world are facing the challenge of relative shortage and unavailability of RT-PCR tests [4].

On the other hand, the role of CT in the workup of patients with suspected COVID-19 is still under debate. Whereas the early evidence from China reported a high sensitivity [5], suggesting that CT could play a pivotal role in the rapid detection of infected patients, many scientific societies have recommended not to use CT for disease screening, especially in asymptomatic patients, due to its low specificity in differentiating COVID-19 from other viral infections that could present with overlapping imaging features [6, 7]. Moreover, as underlined in two recently published meta-analyses, the high variability of CT findings considered to define a CT scan consistent with SARS-CoV-2 pneumonia introduced a relevant bias hampering the applicability of results and leading to an underestimation of the real specificity of CT [1, 8].

In this scenario, the introduction of standardized reporting systems for patients with suspected COVID-19 has been extensively advocated with the aim to improve communication with referring physicians and make CT diagnostic accuracy data comparable across different institutions to gather scientific evidence [9, 10]. Trying to accomplish this task, the Dutch Radiological Society has recently developed the COVID-19 Reporting and Data System (CO-RADS) and tested it on a sample of symptomatic patients, demonstrating a satisfying diagnostic yield for predicting COVID-19 with substantial interreader agreement [11].

However, to promote its introduction into clinical practice, a more extensive validation of the CO-RADS should be provided, especially considering a larger patient sample and radiologists with different levels of experience as readers.

Therefore, the purpose of our study was to investigate, in a large sample of consecutive patients and through the analysis of multiple readers with different levels of expertise, whether the CO-RADS score is a valid tool in order to identify lung involvement in patients with suspected COVID-19.

Materials and methods

This retrospective, single-center study was approved by our institutional review board and written informed consent was waived. Reporting was done in accordance with the Standards for Reporting Diagnostic Accuracy Initiative (STARD) recommendations [12]

Study participants

We retrospectively reviewed medical records from our university hospital archives to search for consecutive patients who underwent chest CT and RT-PCR testing for suspected COVID-19, from March 9, 2020, to May 3, 2020. Chest CT and RT-PCR testing were performed for clinical suspicion of COVID-19 based on the presence of at least one of the following respiratory tract infection symptoms: (I) fever higher > 37.5 °C, (II) cough, and (III) clinically relevant dyspnea, with or without a history suggestive of exposure to SARS-CoV-2 including (a) close relationship with a confirmed positive individual, (b) travel or residential history in areas with high prevalence of disease, or (c) contact with individuals with fever or respiratory symptoms from those areas within 14 days prior to CT scan.

Exclusion criteria were as follows: (I) lack of RT-PCR testing results, (II) time interval between CT scan and RT-PCR longer than 7 days, and (III) uninterpretable CT scans due to motion artifacts or incomplete scanning.

CT Technique and image analysis

All CT scans were performed using a single 64-slice CT scanner (LightSpeed VCT, GE Healthcare). All patients were scanned on supine position during single deep-inspiration breath-hold. No contrast medium was administered. Scanning parameters were as follows: tube voltage of 100 or 120 kV according to the patient’s body size, variable tube current with automatic mAs modulation (Smart mA, GE Healthcare), 0.6-mm section thickness and a pitch of 1.388, and iterative reconstruction techniques (ASIR) at 40%.

All CT scans were retrieved from the Picture Archiving and Communication Systems, anonymized and uploaded onto a dedicated workstation (SuiteEstensa 2.0, EBIT - Esaote Group Company) for image analysis.

Each CT scan was independently analyzed by twelve readers, stratified into four different groups according to their experience as follows: high-experience group (R1, R2, and R3 [D.B., U.D., and E.D.], board-certified radiologists with more than 10 years of experience in thoracic imaging and more than 100 COVID-19 positive CTs reported); intermediate-experience group (R4, R5, and R6 [S.M., M.G.C., and M.I.], board-certified radiologists with more than 50 and less than 100 COVID-19-positive CTs reported); low-experience group (R7, R8, and R9 [M.D.I., F.G., and E.O.], radiologists in-training with less than 50 COVID-19-positive CTs reported); and group of radiographers (R10, R11, and R12 [S.P., V.C., and C.G.], all with a background of more than 50 CTs performed on COVID-19-positive patients). A training set of 30 CTs, in which findings corresponding to each CO-RADS category were equally distributed, was provided to each reader. Furthermore, all readers had a general familiarity with CO-RADS, having adopted it at our institution since its introduction, approximately a month before the start of our study.

All readers scored each CT scan assigning a CO-RADS category reflecting their overall suspicion of COVID-19 lung involvement as follows: CO-RADS 1, very low probability; CO-RADS 2, low probability; CO-RADS 3, equivocal/unsure probability; CO-RADS 4, high probability; and CO-RADS 5, very high probability. For a detailed description of all the CT findings associated with each CO-RADS category, please refer to the original paper by Prokop et al [11]. All readers were blinded to the RT-PCR results, to the clinical information and radiological reports of individual patients, and to the disease prevalence in the study sample.

Reference standard

RT-PCR testing performed on respiratory specimens obtained by nasopharyngeal and throat swabs served as a reference standard for the diagnosis of COVID-19. Clinical information and index test results were not available to the assessors of the reference standard. As per our institution guidelines, patients with initial negative RT-PCR, but CT findings suggestive of COVID-19, underwent repeated RT-PCR testing up to a maximum of three times within 7 days after CT scan. Patients who showed at least one positive RT-PCR were considered to be positive for COVID-19; otherwise, they were considered negative. Nonetheless, patients with initial negative RT-PCR and negative CT findings underwent a 14-day follow-up and were considered to be negative if no symptoms’ worsening or laboratory findings consistent with COVID-19 occurred.

Statistical analysis

Categorical variables were expressed as frequencies or percentages. Continuous variables were expressed as means ± standard deviations (SD). The Χ2 test was used to calculate differences in sex, symptoms, and number of RT-PCR testing between COVID+ and COVID− participants. The Mann-Whitney U test was performed to assess differences in age between the two groups.

Fleiss’ kappa statistics were used to evaluate interreader agreement for CO-RADS rating both among all readers and among each group of reader. The following coefficients were applied: κ ≤ 0.20, slight agreement; κ = 0.21–0.40, fair agreement; κ = 0.41–0.60, moderate agreement; κ = 0.61–0.80, substantial agreement; and κ = 0.81–1.0, almost perfect agreement [13].

For each reader, the receiver operating characteristics curve (ROC) and the corresponding area under the curve (AUC) were calculated by using the DeLong et al method [14], to assess the CO-RADS diagnostic performance. Mean AUC across observers from the four different readers’ groups and their corresponding 95% confidence intervals (95% CI) were computed and a pairwise comparison of AUCs from all readers was performed by means of the DeLong et al method [14].

For each reader, the highest Youden index (J = sensitivity + specificity− 1) was calculated to select the optimal threshold to discriminate between COVID+ and COVID− participants, and the corresponding sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were computed. Inconclusive results (i.e., CO-RADS 3) were included in the analysis of the diagnostic performance and whether to treat them as positive or negative results depended on the ROC curve and Youden’s index analysis results.

In addition, the number and the percentage of readings assigned to each CO-RADS category were determined for both COVID+ and COVID− participants. False-positive CO-RADS 4 and 5 patients and false-negative CO-RADS 1 and 2 patients were subsequently investigated to clarify the reason for erroneous classification.

In all cases, p < 0.05 was considered the threshold for assessing statistical significance. All statistical analyses were performed with commercially available software (MedCalc Statistical Software version 19.2.5, MedCalc Software Ltd).

Results

Study participants

Figure 1 portrays the study participants’ STARD flowchart. From 789 consecutive patients initially screened from our database, we included in the final study sample 572 patients (mean age, 63 years ± 20standard deviation [SD]; range, 8–96 years; including 332 men [mean age, 62 years ± 19 SD; range, 8–96 years] and 240 women [mean age, 64 years ± 21 SD; range, 12–96 years]). Two hundred and seventeen patients were excluded due to the lack of reference standard (n = 25), interval between CT scan and reference standard longer than 1 week (n = 187), or uninterpretable CT scans (n = 5). Detailed baseline characteristics of the patient study groups are reported in Table 1.

Fig. 1
figure 1

Study flowchart for the inclusion and exclusion criteria of the patient sample and CO-RADS ratings. Note: CO-RADS, COVID-19 Reporting and Data System; COVID-19, coronavirus disease-2019; RT-PCR, reverse transcriptase-polymerase chain reaction; PCR+, single or multiple RT-PCR testing positive for SARS-CoV-2 infection; mPCR−, multiple negative RT-PCR testing; sPCR-/FU-, single negative RT-PCR testing and negative follow-up during the 14 days following the CT scan

Table 1 Patient characteristics

Reference standard

Three hundred and fifty-five participants underwent a single RT-PCR testing, whereas 217 participants underwent multiple RT-PCR testing, including 85 patients who were tested twice, and 132 participants who underwent three consecutive RT-PCR testing. The median time interval between CT scan and reference standard was 2 days (range 0–7).

One hundred and forty-two patients had at least one RT-PCR testing positive for SARS-CoV-2 infection, resulting in a disease prevalence in our study sample of 24.8%. Four hundred and thirty patients were tested negative for SARS-CoV-2 infection, including 91 patients with multiple negative RT-PCR testing, and 339 (78.8%) patients with initial negative RT-PCR and negative follow-up during the 14 days following the CT scan.

Interrater agreement

There was a moderate agreement for overall CO-RADS rating among all observers (Fleiss’ K = 0.43 [95% CI 0.42–0.44]). Regarding each individual CO-RADS category, there was a substantial agreement for CO-RADS 1 category (Fleiss’ K = 0.61 [95% CI 0.60–0.62], moderate agreement for CO-RADS 5 category (Fleiss’ K = 0.60 [95% CI 0.58–0.61]), and fair agreement for CO-RADS categories 2 and 4 (Fleiss’ K = 0.28 [95% CI 0.27–0.29] and 0.23 [95% CI 0.22–0.24], respectively), and poor agreement for CO-RADS 3 category (Fleiss’ K = 0.16 [95% CI 0.15–0.17]). Fleiss’ K values among each group of readers were as follows: high-experience group: 0.54 (95% CI 0.52–0.57), intermediate-experience group: 0.38 (95% CI 0.36–0.41), low-experience group: 0.66 (95% CI 0.64–0.69), and group of radiographers: 0.30 (95% CI 0.27–0.32).

CO-RADS diagnostic performance

Receiver operating characteristic analysis showed the CO-RADS score ≥ 4 as the optimal threshold to discern between patients with PCR+ from those with PCR−. Considering the high-experience reader group and CO-RADS score ≥ 4 as a threshold, the average value of area under the curve was 0.74 (95% CI 0.67–0.79) (Fig. 2), with a sensitivity of 59.1% (95% CI 0.52–0.67) and a specificity of 84.1% (95% CI 0.81–0.87).

Fig. 2
figure 2

Receiver operating characteristic curve of high-experience group (a), intermediate-experience group (b), low-experience group (c), and group of radiographers (d) for predicting lung involvement by coronavirus disease-2019 using the COVID-19 Reporting and Data System (CO-RADS). Note: AUC, area under the curve

The average value of area under the curve for intermediate-experience readers was 0.70 (95% CI 0.66–0.76) (Fig. 2), with a sensitivity of 62% (95% CI 54.2–69.9) and a specificity of 78.7% (95% CI 74.5–82.5). The average value of area under the curve for low-experience readers was 0.73 (95% CI 0.67-0.79) (Fig. 2), with a sensitivity of 62.7% (95% CI 53.2–70.1) and a specificity of 79.3% (95% CI 75.1–83). Averaging data from radiographers, we obtained an area under the curve of 0.68 (95% CI 0.62–0.75) (Fig. 2), a sensitivity of 40.8% (95% CI 33–49), and a specificity of 87.3% (95% CI 83.8–90.3).

The diagnostic accuracy values for all readers are supplied in Table 2. When all datasets from each radiologist were blended, the optimal diagnostic threshold of ≥ 4 yielded an area under the curve of 0.72 (95% CI 66–78%) with a sensitivity of 61% (95% CI 52–69%) and specificity of 81% (95% CI 77–84%). Detailed results from pairwise comparison of the different readers’ AUC are displayed in Table E1 of the supplemental material; briefly, there was no difference in diagnostic accuracy between readers with the highest AUC from the three different radiologists’ groups (p > 0.05 for all comparisons).

Table 2 Readers diagnostic performance for lung involvement by COVID-19 when CO-RADS ≥ 4 was used as a positive threshold

We reported a total number of 6864 CO-RADS scorings from twelve readers (Fig. 3). Among them, 911 (8.9%) in 185 negative patients for SARS-CoV-2 infection were scored as 4 and 5 and resulted in false-positive chest CT. The post hoc analysis of those cases showed that 37.8 % (70/185) were effectively a 4 and 5 CO-RADS pattern and 62.2% (115/185) were erroneous evaluations from readers. False-negative CO-RADS 1 and CO-RADS 2 were found in 586 (8.5%) readings (96 patients). Among them, 44% (42/96) were correctly identified as CO-RADS 1 and 2 categories, while 56% (54/96) were erroneous evaluations. Representative CT images for each CO-RADS category are illustrated in Fig. 4.

Fig. 3
figure 3

Distribution of final diagnosis among each cumulative CO-RADS score category. Red columns show the percentage of patients with positive RT-PCR (PCR+), dark green columns show the percentage of patients with multiple negative RT-PCR (mPCR−), and light green columns show the percentage of patients with initial negative RT-PCR and negative clinical follow-up the 14 days after CT scan (sPCR-/FU-). Note: CO-RADS, COVID-19 Reporting and Data System

Fig. 4
figure 4

Pictorial overview portraying axial chest CT images from our study population illustrating imaging findings characteristics of the CO-RADS 1 (a, b), CO-RADS 2 (c, d), CO-RADS 3 (e, f), CO-RADS 4 (g, h), and CO-RADS 5 (i, j, k, l) scores and their corresponding descriptions. Note: CO-RADS, COVID-19 Reporting and Data System

Discussion

Since the early stage of the coronavirus disease-2019 (COVID-19) outbreak, the value of CT in the workup of patients with suspected SARS-CoV-2 infection has been controversial [15, 16]. The adoption of standardized reporting systems for lung involvement in COVID-19 has been encouraged in an effort to enhance communication with referring physicians and multicentric gathering of CT diagnostic accuracy data [9,10,11]. In our retrospective study, we aimed to validate the proposed COVID-19 Reporting and Data System (CO-RADS) in a retrospective multireader diagnostic accuracy study. Our results show that, when a threshold of CO-RADS ≥ 4 was used, readers with different levels of expertise were able to discriminate in approximately 30 s between patients with positive and negative reverse transcriptase-polymerase chain reaction (RT-PCR) testing, with a sensitivity of 61% (95% CI 52–69%), a specificity of 81% (95% CI 76–84%), and an AUC of 0.72 (95% CI 67–75%).

Our findings are in contrast with those reported by the earlier literature on CT diagnostic performance for COVID-19. In detail, a recent meta-analysis estimated a pooled sensitivity and specificity for chest CT of 94% and 37%, respectively [8]. However, it has been underlined that most of the earlier studies investigating the diagnostic performance of CT for COVID-19 were flawed by interpretation bias [15, 17, 18]. Indeed, most of the studies published to date did not provide pre-specified criteria to differentiate between positive and negative CT scans. Therefore, it could be hypothesized that applying unclear CT signs to define a positive case may have resulted in an optimistic overestimation of CT sensitivity at the expense of a reduced specificity. On the other hand, our results suggest that the use of standardized threshold for CT findings evaluation provided by reporting systems such as CO-RADS could more realistically reflect the diagnostic value of CT in the daily clinical practice, characterized by a lower sensitivity and an improved specificity for COVID-19.

CO-RADS yielded an overall moderate interreader agreement for the assessment of lung involvement in COVID-19 (Fleiss’ K = 0.43 [95% CI 0.42–0.44]). This is comparable to what is reported by Prokop et al, who found an overall moderate interrater agreement among the 8 observers included (Fleiss’ K = 0.47 [95% CI 0.45–0.49]) [11].

In the current study, CO-RADS yielded a lower diagnostic accuracy compared with what was reported by Prokop et al, who found a mean AUC of 0.91 (95% CI 0.85–0.97) for positive RT-PCR testing. We believe that this discrepancy could be related to the higher degree of expertise and familiarity with the CO-RADS scoring of the readers involved in the study by Prokop et al. However, we observed that CO-RADS diagnostic accuracy was not significantly affected by readers’ experience, when CT scans were read by radiologists with different levels of expertise, including radiologists in-training. These results suggest that the adoption of the CO-RADS reporting scheme could represent an effective strategy to facilitate and speed up the learning curve of radiologists in-training in the evaluation of chest CT scans of patients with suspected lung involvement by COVID-19. This latter aspect is noteworthy as, at many institutions, radiologists in-training are the first to assess CT scans, and they should be able to rapidly recognize typical CT findings of COVID-19. Indeed, the finding of CT patterns consistent with COVID-19 even in asymptomatic patients is not uncommon, and their recognition by radiologists in-training at the moment of CT scanning might limit the risk of spreading the contamination, allowing suspicious cases to be directed promptly to dedicated care pathways.

This study has several limitations that should be addressed. First, this is a single-center, retrospective study; further multicenter studies with a prospective design are warranted to confirm our findings. Second, multiple RT-PCR testing to rule out SARS-CoV-2 infection was available only for 92 (21 %) of the participants who were tested negative at the initial RT-PCR; this could raise concerns about the reference standard reliability in this subset of participants, since a considerable rate of initial false-negative RT-PCR has been reported [1,2,3]. However, all patients with initial negative RT-PCR underwent clinical and laboratory monitoring for at least 14 days to confirm their negative status in accordance with guidelines from our institution. Third, our study sample is composed only of symptomatic patients admitted to the Emergency Department with suspected COVID-19, without including asymptomatic patients. We recognize that this may have biased patient selection toward participants with a more severe disease spectrum, thereby affecting the CO-RADS accuracy estimate. However, our study sample reflects the actual population for whom chest CT scan is recommended as a problem-solving tool [6, 7]. Moreover, the CO-RADS scheme has been developed specifically to be used in patients with moderate to severe symptoms [11]. Fourth, a final diagnosis was not available for patients with false-positive CT findings. Therefore, we were not able to assess the prevalence of co-morbidities or other respiratory tract infections, whose imaging findings may overlap with those typically observed in COVID-19. Additional studies addressing this issue are advisable to validate further the CO-RADS algorithm, since its performance may vary according to the different prevalence of conditions with overlapping CT characteristics.

In conclusion, CO-RADS score, tested on a large sample of symptomatic patients, should be considered a valid tool for the identification of lung involvement in patients with suspected COVID-19, showing high diagnostic accuracy across readers with different levels of expertise. Specificity, in particular, is higher than previously thought and that could lead to reconsider the role of CT in this clinical setting.