1 Introduction

As of January 9, 2023 [1], the COVID-19 pandemic had caused about 6.71 million fatalities and 668 million infections, with new variations periodically developing [2]. Business, economic, and social dynamics on a worldwide scale were all affected. Governments throughout the world have adopted flight restrictions, social isolation, and heightened awareness of hygiene. COVID-19 is easily mistaken for other viral infections making detection challenging [3, 4]. Reverse-transcription polymerase chain reaction (RT-PCR) arrays are the primary approved diagnostic approach for COVID-19 identification [5,6,7]. Contamination/damage in the sample or viral changes to the COVID-19 genome may hinder its detection performance [8,9,10]. Sadly, despite the need of prompt diagnosis, this test can take up to six hours from sample collection and may require multiple tests to distinguish between false-negative and false-positive results [7, 11]. The false detection due to contamination concerns in RT-PCR detection and the delays caused have motivated several studies [12,13,14] have demonstrated chest computed tomography (CT) imaging as a non-invasive alternative. In addition, multiple publications [12, 13, 15] have advised a CT scan as a follow-up test for patients with COVID-19 symptoms and negative RT-PCR findings due to them being non-invasive and detailed that can aid in diagnosis. However, it has poor sensitivity in early instances of COVID-19 [16], and image gathering is time-consuming, susceptible to infection transmission (since it is performed in close proximity to patients) [17], and expensive [18]. On the other hand chest X-ray (CXR) imaging is less expensive, faster, and more widely available than computed tomography (CT), and it exposes the body to less radiation [19]. Recently chest X-rays are widely used as a COVID-19 screening alternative, and their predictive value has been established [20]. As early COVID-19 cases had bilateral, multifocal ground-glass opacities (GGO) with posterior or peripheral distribution, predominantly in the lower lung lobes, which progressed to pulmonary consolidation [21, 22]. Many lung diseases have similar symptoms. Thus, doctors struggle to differentiate between COVID-19 infection and other forms of viral pneumonia. Consequently, symptom similarities may result in a misdiagnosis, delayed treatment, or even death in this instance. Therefore, there is an urgent need for technology to assist physicians in their analysis.

Significant advances in Deep Learning approaches have led to state-of-the-art performance in a variety of Computer Vision applications, including image classification, object recognition, and image segmentation, in recent years. Because of this development, solutions based on deep learning are currently applied in a wider range of fields. Since the advent of deep Convolutional Neural Networks (CNNs), the use of CNNs to CXR images has been the topic of substantial research and broad adoption. Rajpurkar et al. [23] proposed the CheXNet network by updating Densenet121 on one of the largest Chest X-ray datasets [24] consisting of one hundred thousand X-ray pictures for fourteen distinct diseases. Rahman et al. [25] trained CXRs to detect pulmonary tuberculosis using a dataset of 3500 infected and 3,500 normal CXRs (TB). In addition, they retrained the DenseNet201 network with TB and normal datasets, attaining a TB diagnosis sensitivity of 99.57%. Khuzani et al. [26] postulated that a set of CXR image features might be built using the dimensionality reduction method to create an effective machine learning classifier capable of distinguishing COVID-19 cases from non-COVID-19 cases with high accuracy and sensitivity. Mathew et al. [27] developed a Siamese neural network-based severity score to automatically quantify radiographic COVID-19 pulmonary disease severity. This score was validated with pulmonary X-ray severity (PXS) scores from two thoracic radiologists and one radiologist-in-training. Kim et al. [28] suggested a fully automated triage pipeline that analyzes chest radiographs for the presence, severity, and progression of COVID-19 pneumonia with 79.9% accuracy. In [29], Maguolo and Nanni questioned the efficacy of COVID-19 detection from X-rays in various literature and suggested that it should incorporate larger and more diverse X-rays to eliminate biases. Robert et al. [30] have reached a similar conclusion by doing a comprehensive literature study and proposing the use of a wide and diversified dataset for the idea of COVID-19 detection from Chest X-rays. The authors of this study were also the pioneers in presenting a cutting-edge deep learning model for detecting pneumonia [31] and COVID-19 [32] from chest X-rays. However, until recently, lung segmentation was used as the first step in their detection technique [33, 34], which assisted to localize the decision-making area for machine learning networks. They generated 704 X-ray images for Normal and TB patients using the well-known Montgomery [35] and Shenzhen [36] CXR lung mask databases. In extreme COVID-19 situations, where the lungs are severely deformed, or where images are of low resolution, the segmentation performance can degrade. Using an effective human–machine collaboration technique to annotate ground-truth lung segmentation masks, another study has built the largest benchmark dataset with 33,920 CXR images and 11,956 COVID-19 samples using a human–machine collaborative strategy [37]. According to the authors’ knowledge, this is the largest CXR lung segmentation dataset, which can aid in the development of CXR-related computer-aided diagnostic tools employing deep learning techniques. In this study, the researchers segmented the lung areas from the CXR images using the model trained on this cutting-edge dataset. In a previous study [38], we examined the effect of image enhancement techniques on segmented lungs for COVID-19 prediction, confirming that gamma correction enhancement provided an F1-score of approximately 90% using a dataset of 18,479 Chest X-ray images (8851 normal, 6012 non-COVID other lung diseases, and 3616 COVID-19) and their ground truth lung masks. Huang and Liao in [39] have proposed a lightweight CNN-based network (LightEfficientNetV2) for COVID-19 detection with the help of segmented lung images. The network achieved 98.33% accuracy in COVID-19 disease from Pneumonia and Normal using 21,000 images. They claimed that the same network achieved 97.48% accuracy on CT images. Despite the benefits of radiological imaging being non-invasive and the application of machine learning speeding up the diagnosis, several studies have favored accurate blood biomarkers since variations in them can also help determine the severity, and progression of an abnormality [40, 41].

Recent research indicates that biomarkers can play a significant role in providing vital information about an individual's health and recognizing COVID-19. In addition, they can be utilized to diagnose severity, progression, and forecast mortality. Sarah et al. [42] introduced the Kuwait Progression Indicator (KPI) score as a predictor of the severity of COVID-19 progression. The KPI model was based on laboratory variables, which are objectively measurable measurements, as opposed to grading systems that rely on self-reported symptoms and other subjective features. Patients were classified as low risk if their KPI score was less than − 7 and as high risk if it was greater than 16; however, those with a score between − 6 and 15 had an unknown likelihood of advancement. This restricts its applicability to a broad range of patient populations. Weng et al. [43] presented the ANDC early prediction score to predict COVID patient mortality risk. This model was constructed using information from 301 adult individuals with laboratory-confirmed COVID-19. Age, neutrophil-to-lymphocyte ratio, D-dimer, and C-reactive protein were identified as major predictors of mortality for COVID-19 patients by LASSO regression. Area under the curve (AUC) values of 0.921 and 0.975 for the derivation and validation cohorts, respectively, indicate that the nomogram was well-calibrated and discriminative. Patients with COVID were separated into three groups based on ANDC cutoff values of 59 and 101. The low-risk group (ANDC < 59) had a mortality probability below 5%, the moderate-risk group (59 < ANDC < 101) had a mortality probability between 5 and 50%, and the high-risk group (ANDC > 101) had a mortality probability greater than 50%. Using a dataset of 444 patients, Xie et al. [44] created a predictive model that integrates age, lactate dehydrogenase (LDH), lymphocyte count, and SpO2 as independent predictors of death. The model performed well in both internal (c = 0.89) and external (c = 0.98) validations. However, the model over predicted low-risk individuals while under predicting high-risk people. These severity scoring can help in allocating resources efficiently to the high-risk predicted patients. Intensive care units (ICUs) are essential for preserving severely ill COVID-19 patients because they provide oxygen, 24-h monitoring, care, and when necessary, assisted ventilation. In regions with a high COVID-19 infection incidence, therefore, ICU beds are a useful resource [45,46,47]. Within the first hour of a hospital visit, routinely collected healthcare data such as blood tests and vital signs assessments are typically available. These data give the COVID-19 patient change patterns observed in various retrospective observational investigations [48,49,50]. The results of these research indicate that alanine aminotransferase (ALT), lymphocyte count, D-dimer, C-reactive protein (CRP), and bilirubin concentrations are significant clinical markers. Islam et al. in [51] developed a generic and reliable predictive model with an accuracy of 85.35% for ICU admission for COVID-19 patients using the optimal feature combination from the patient data upon admission utilizing data from the pulmonology department of Moscow City State Hospital. Significant risk variables for ICU admission were identified as C-reactive protein (CRP), chest computed tomography (CT), lung tissue damage (%), age, hospital admission, and fibrinogen parameters at hospital admission. Consequently, clinical biomarkers can be utilized to construct a highly accurate prognosis model utilizing traditional and deep learning methods.

Convolutional Neural Networks (CNNs) can be trained to classify diseases based on radiographic and other images but cannot consistently identify the underlying medical cause. Using a combination of patient symptoms, physical exam findings, laboratory data, and radiologic imaging findings, the underlying etiology and severity can be diagnosed. Consequently, machine learning algorithms that combine information from Chest X-rays with other clinical data from the electronic health record (her) will be able to better precisely predict the patient's severity. However, attempts to combine Electronic Health Record (HER) and imaging data for machine learning applications in healthcare have not been widely studied. Few studies have utilized a combination of radiographic imaging, clinical biomarker data, and artificial intelligence to predict the prognosis of COVID-19 patients. Jiao et al. in [52], using patients data from hospitals in USA, have developed a machine learning model using clinical data and CXR pictures to predict the severity and development of COVID-19 with an AUC of 82%. Chieregato et al. [53] proposed a multimodal approach based on CT images and clinical parameters, which were supplied to Boruta feature selection algorithm with ShAP (SHapley Additive exPlanations) values, and then the CatBoost gradient boosting classifier demonstrated an AUC of 0.949% for reduced features on the holdout test set. With a probability score based on the significance of SHAP features, the model aimed to provide clinical decision support to medical doctors. However, the published research has either yielded unsatisfactory results, employed tiny datasets, limiting the generalizability of the models, or employed CT, which has drawbacks as a technology. Notably, there has been research employing a DNA-based approach, but such data are not publicly accessible, and genome sequence-based investigations are computationally expensive [54].

As it can be seen that the pandemic has triggered much research in the early detection of pulmonary abnormalities using clinical imaging such as CT scans, Chest X-Rays but had some limitations, which could be only addressed from clinical examinations from blood biomarkers [55,56,57]. The advances in Machine Learning approaches have further catapulted the early detection automatically with high reliability and without the need of medical expert opinion, but there is a need to make it more reliable combining the imaging and blood biomarkers information. The authors of this study were inspired by the aforementioned pitfalls to create a multimodal system that uses CXR and a clinical biomarker-based system to stratify the severity of COVID-19 patients and their risk of death. Although most of the clinicians agree that there is a great need of multimodal system but the main challenge to develop such a system is the availability of such multimodal dataset. Even though a plethora of publications came out in the early and later stage of the pandemic, a very few works proposed different multimodal system to make the model reliable and explainable to the clinicians. This is one of the first studies to develop a COVID-19 severity prediction model using both CXR and biomarkers. The paper proposes a comprehensive, dependable, and novel approach that supplements all previous work in this domain. The method is applicable not only for COVID-19 detection and severity classification but also for any other lung abnormality-related complications. The following details help to explain the approach's novelty and utility:

  • To segment the lungs from Chest X-rays, the authors used a robust segmentation network (which they proposed in their previous work [37]). This type of segmentation will aid the machine learning network in determining the region of interest.

  • To extract features from the segmented X-rays, the cutting-edge machine learning network ChexNet [58, 59] (which was developed using the largest Chest X-ray dataset) was used.

  • A multimodal technique based on Chest X-rays and Common Blood Count features was used. This will allow the network to perform more accurately when the severity cannot be determined solely by X-rays.

  • Applied the stacking method to improve classification and severity performance.

  • Created a nomogram scoring technique that clinicians can use to predict the severity of COVID-19 patients.

  • The entire solution was implemented as an easy-to-use app for clinicians.

The rest of the article is organized as follows: Sect. 2 describes the study’s methodology, which includes dataset descriptions, preprocessing stages, machine learning and stacking techniques, and the development of a nomogram-based scoring system. Section 3 presents the experimental results and reports on the performance of the scoring technique, while Sect. 4 explains the results. Section 5 concludes the article by making future recommendations.

2 Methodology

This study included two major investigations. The first study used a multimodal stacking model-based approach combining CXR images and clinical data to predict the severity risk of COVID-19 patients, while in the second study, CXR images and clinical biomarkers-based combined features were used to predict the death outcome in high-risk patients using a nomogram-based scoring system. The method is organized with the following sub-sections: System Architecture, Dataset Description, Statistical Analysis, Data Preprocessing, Experiments and Performance Metrics.

2.1 System architecture of the proposed system

First, CXR images are preprocessed, and the lung area is segmented and fed to a pre-trained deep CNN model to extract image features, which are then reduced in dimensionality using principal component analysis (PCA). Clinical data were processed in parallel, and clinical features were ranked using a feature selection algorithm. Finally, the PCA components and top-ranked clinical features were combined to create a stacking ensemble model to predict whether patients were low or high risk. Then the high-risk patient’s combined reduced dimensionality features were used to develop another stacking model. Furthermore, we developed a scoring technique based on a nomogram using the stacking model for the early prediction of death outcomes. The methodology is depicted schematically in Fig. 1.

Fig. 1
figure 1

Overview of the proposed system architecture

The study proposed a stacking-based approach and compared the performance with conventional ML classifiers. This approach consists of two-step learners such as base learners and meta learners. The three best-performing ML classifiers were selected as base learner models in the stacking architecture and logistic regression was used for the meta learner model (\(M_{f} )\) in the second phase of the stacking model and finally produced the final prediction. Figure 2 shows the architecture of the proposed stacking model which combines N numbers of best-performing classifiers \(m_{1} , \ldots ,m_{n}\) using an input dataset D, which has a feature vector (\({\varvec{x}}_{{\varvec{i}}}\)) and corresponding label (\({\varvec{y}}_{{\varvec{i}}}\)). In the first step, n base level ML classifier produces the prediction probabilities \(y_{1} , \ldots ,y_{p}\). Finally, the prediction probabilities of the best performing base learners feed to a logistic regression-based meta-learner classifier (\(M_{f }\)) for the final prediction.

Fig. 2
figure 2

Proposed stacking model architecture

Each ML classifier in the base learner predicts a probability distribution according to the output class values. Therefore, a probability distribution is created for the input x using the predictions of the base-level classifier set m in Eq. 1:

$$ {\text{PD}}^{m} \left( x \right) = \left( {{\text{PD}}^{m} \left( {l_{1} |x} \right),{\text{PD}}^{m} \left( {l_{2} |x} \right), \ldots ,{\text{PD}}^{m} \left( {l_{r} |x} \right)} \right) $$
(1)

where \((l_{1} ,l_{2} , \ldots ,l_{r} ) \) is the original class values, and \({\text{PD}}^{m} \left( {l_{i} {|}x} \right)\) denotes the probability distribution such as x belongs to a class \(l_{i}\) as estimated (and predicted) by classifier m. The class \(l_{k}\) with the highest-class probability \({\text{PD}}^{m} \left( {l_{i} {|}x} \right)\) is predicted by classifier m. The meta-learner attributes are the probabilities produced for each class by each of the base-level classifiers, i.e., \({\text{PD}}^{{m_{k} }} \left( {l_{i} |x} \right)\) for \(i = 1, \ldots , r\) and \(k = 1, \ldots , n\); where r and n are the number of classes and the number of base-level classifiers.

2.2 Dataset description

The study utilized a dataset from the first wave of COVID-19, collected between March and June of 2020, which contained both CXRs and clinical data collected from six Italian hospitals at the time of admission for symptomatic COVID-19 patients [60]. This dataset includes a Posterior Anterior (PA) or Anterior Posterior (AP) view of 930 X-ray images and clinical data for COVID-19-positive patients [60]. Each patient tested positive for COVID-19 using RT-PCR. This data collection includes 396 (42.6%) low-risk individuals and 534 (57.4%) high-risk patients. In addition, 364 (68.2%) of the 534 high-risk patients survived, while 170 (31.8%) perished. Figure 3 depicts CXR image samples from the dataset.

Fig. 3
figure 3

Chest X-ray sample images for COVID-19. A Low-risk patients, B High-risk patients with survival outcomes, and C High-risk patients with death outcomes

2.3 Statistical characteristics

A statistical analysis of the patient’s demographic, signs and symptoms, clinical data, comorbidity, and the outcome was performed using Stata/MP 13.0 software. The dataset includes gender, age, and twenty-three signs and symptoms, comorbidity, and clinical biomarkers. Table 1 shows the statistical characteristics of 25 parameters (age, gender, sign and symptoms, comorbidity, clinical biomarkers). Gender is represented numerically and in percentages. For the remaining variables, the number of missing data (N), presence and absence of signs and symptoms, mean (M), and standard deviation (SD) were reported. Gender was subjected to univariate analysis (Chi-square test), while the other variables were subjected to Wilcoxon’s ranked tests. Using a 95% statistical significance criterion, the p value was considered significant if it was less than 0.05.

Table 1 Summary of statistical characteristics of the study patients

2.4 Data preprocessing

This section discusses the data preprocessing steps for both data modalities in detail.

2.4.1 Chest X-ray image preprocessing

  1. A.

    Gamma correction

Image enhancement is a common image-processing technique that emphasizes important information in an image while reducing or removing other information to improve identification quality. As demonstrated in our previous work [38], gamma correction was applied to CXRs, which improves COVID detection performance by improving image quality. For image normalization, linear operations such as pixel-wise scalar multiplication, addition, and subtraction are frequently used, whereas the Gamma correction technique is a nonlinear operation performed on the pixels of the source image. Gamma correction employs a projection link with gamma and pixel values determined by the internal map. The pixel value here can range from 0 to 255. Figure 4 shows a samples X-ray image for before and after applying gamma correction. If G is the gray scale value, then the gamma corrected output pixel s(G) can be written as in Eq. (2):

$$ s\left( G \right) = 255\left( \frac{G}{255} \right)^{1/\gamma \left( G \right)} $$
(2)

where γ(G) represents the gamma value.

Fig. 4
figure 4

Samples X-ray image for before (a) and after (b) applying gamma correction

  1. B.

    Lung segmentation

As previously discussed, it is critical to localize the region of interest for machine learning networks, in this case, the lungs in the Chest X-rays. In our previous work for CXR lung segmentation [37], the Feature Pyramid Networks (FPN) [61] segmentation network with the DenseNet121 [59] encoder as a backbone outperformed other conventional segmentation networks. In [37], three segmentation architectures with different encoder backbones: U-Net [62], U-Net +  + [63], and Feature Pyramid Networks (FPN) were investigated [61]. It segmented the lung area very accurately using the FPN network with DenseNet121 as the backbone, which was confirmed by experienced radiologists. Figure 5 depicts some of the X-ray images and their corresponding masks.

  1. C.

    Feature extraction

Fig. 5
figure 5

Samples X-ray images from the study dataset (A), generated masks by the best performing densenet121 FPN model (B) and corresponding segmented lung (C)

To extract important features from segmented Chest X-rays, a ChexNet CNN model based on DenseNet-121 [59] architecture was used. It’s worth noting that CheXNet is a DenseNet variant that was trained on a large Chest X-ray dataset, and the pre-trained model is publicly available. As demonstrated in our previous work [38], it performed exceptionally well in the COVID-19 classification task. To extract useful features from the segmented lung area of the CXR images, features from the CheXNet model’s last layer (‘AvgPool’) before the Softmax layer were extracted.

  1. D.

    PCA for dimensionality reduction

To reduce the dimensionality of the feature space produced by the ChexNet model, a feature reduction technique known as Principal Component Analysis (PCA) was used. It converts high-dimensional data into a new lower-dimensional representation with the least amount of reconstruction error. There is no redundant data in the reduced set because all the fundamental components are orthogonal to one another. Whitening was used to calculate PCA, which can improve accuracy by forcing data to meet certain assumptions.

2.4.2 Clinical data preprocessing

  1. A.

    Data imputation and normalization

Missing data imputation is the most critical stage in clinical data preprocessing for machine learning model construction. Many blood biomarkers were obtained for each patient, and many of them were missing in some patients. Rather than removing missing data for the various variables, various imputation techniques were investigated. Deleting the missing variable may result in the loss of critical and contextual information, as well as having an impact on the generalized representation of the dataset [35]. Missing value imputation methods based on machine learning (ML) have grown in popularity. This technique, on the other hand, necessitates the creation of a separate model for each missing data column. In this study, a popular data imputation technique known as multivariate imputation by chained equations (MICE) was used to deal with missing data. According to the literature and previous works of the authors [64,65,66,67], the MICE technique outperforms other imputation techniques for clinical data [68].

The effectiveness of machine learning models for generalized performance is heavily dependent on the input data quality. The process of scaling or changing data so that each feature contributes equally to the training process is referred to as “data normalization.” Numerous studies have shown that normalization improves the performance of machine learning models [29]. Z-score normalization was used in this study by subtracting the average of the data and dividing it by the standard deviation.

  1. B.

    Top-ranked features

The feature selection technique selects the features that have the greatest influence on the output prediction. It helps to reduce overfitting, typically improves accuracy, and drastically reduces training time. Some of the feature selection methods used in the literature include univariate selection, principal component analysis (PCA), recursive feature elimination (RFE), bagged decision trees (e.g., random forest), and boosted trees (e.g., Extreme Gradient Boosting). Random forest frequently outperforms other methods due to its ability to handle datasets with many predictor variables [69]. As a result, a random forest-based feature selection technique was used in this study to rank the features in risk prediction out of 25 variables, including age, gender, sign and symptoms, comorbidity, and clinical biomarkers.

2.5 Experiments

All the experiments in this study were carried out using the PyTorch library and Python 3.7 where PyTorch was used for the feature extraction part using a deep neural network and machine learning algorithms were used from the Scikit-learn library. We did all investigation on an Intel® Xeon® CPU E5-2697 v4 @ 2.30 GHz with 64 GB RAM and a 16 GB NVIDIA GeForce GTX 1080 GPU.

As stated in the earlier section, two different types of investigations were carried out: risk classification and outcome prediction for high-risk patients. Fivefold cross-validation was performed in this study. Therefore, 80% of the data were used for training and 20% for testing in each fold. Finally, a weighted average of the five folds was calculated. The number of training and test Chest X-ray images, and clinical data used in the two experiments are listed in Table 2.

Table 2 Details of the dataset used for training, validation, and testing

2.5.1 Development and internal validation of stacking classification model

We have used reduced features after using PCA from CXR images and top-ranked clinical features individually and in combination and used for risk and death prediction using eight machine learning classifiers, namely Random Forest [55], Support Vector Machine (SVM) [56], K-nearest neighbor (KNN) [57], XGBoost [58], Adaboost [59], Gradient boosting, linear discriminant analysis (LDA) [60], and Logistic regression [61]. This study used fivefold stratified cross-validation where four folds are used to generate a training set for the classifiers and leave onefold for validation. The best-performing three models were used to train, validate, and test the Stacking model (as described earlier).

2.5.2 Experiment-01: risk stratification using CXR Image and clinical data

In this experiment, we investigated three different experiments to predict the risk of COVID-19 patients. The first one is conducted on CXR image features, while the second one is carried on Clinical features, and finally, the combined features from both modalities are used to stratify the risk.

  1. A.

    Binary classification (low vs high risk) using CXR images

The CheXNet model was used to extract features from CXR and then PCA was used to reduce the dimensionality of the CXR features. Then, using reduced feature components and fivefold cross-validation, eight alternative ML classifiers were developed to determine which models performed well in classifying low and high-risk patients. The stacking model was built using the top three base models and a meta-model, and the performance of the stacking technique for the CXR image alone is reported.

  1. B.

    Binary classification (low vs high risk) using clinical data

Using fivefold cross-validation, Top-5 features (LDH, O2 percentage, Age, WBC, and CRP) identified in the previous stage were tested on eight different ML classifiers to determine which models performed best in classifying low and high-risk patients. A stacking model was trained using the top-performing three algorithms as base models to train a meta learner and the performance of the meta learner and base models are reported.

  1. C.

    Binary classification (low vs high risk) using CXR images and clinical data

The performance of decreased CXR feature components and top-ranked clinical variables in categorizing low- and high-risk patients using different ML classifiers for fivefold cross-validation was crucial to determine. This experiment will demonstrate the efficacy of the multimodal method presented in this work in comparison to the hundreds of approaches published on CXR alone and the tens of approaches published on clinical data alone.

2.5.3 Experiment-02: death probability prediction for high-risk patients

We studied three investigations to predict the death outcome of high-risk COVID-19 patients, as shown in Experiment-01. The first one is conducted on CXR image features, while the second one is carried on Clinical features, and finally, the combined features from both modalities are used to stratify the dead and survived patients.

  1. A.

    Binary classification (survival vs death) using CXR images

The features extracted from the CXR images using ChexNet were dimensionality reduced using PCA and used to train eight different ML classifiers to see which models performed well in predicting the mortality outcome of high-risk patients using fivefold cross-validation. Among the eight models, the best performing three models were used to train the stacking model and the results of base and stacking models are reported.

  1. B.

    Binary classification (survival vs death) using clinical data

Top-5 clinical features (LDH, O2 percentage, Age, WBC, and CRP) were tested on eight different ML classifiers to determine which models performed best in predicting the mortality outcome among high-risk patients. A stacking model was trained using the top-performing three algorithms as a base model to train a meta learner and the performance of the meta learner and base models are reported.

  1. C.

    Binary classification (survival vs death) using CXR images and clinical data

As a multimodal approach, we have investigated the efficacy of reduced CXR features and top-ranked clinical features to predict the mortality outcome of high-risk patients using fivefold cross-validation using the same eight models. Then the Top-3 best performing models were used to train the Stacking ML model and the results for base models and stacking model were reported.

2.5.4 Development and validation of logistic regression-based nomogram

Nomograms are a popular graphical scoring technique for converting statistical models into an estimate of the probability of a single event [70]. This can be accomplished using various ML classifiers, such as the Logistic regression classifier. Multiple independent predictors (x) are utilized by logistic regression to predict linearly related outcomes (y). Using linear prediction, the event probability (Pr) can be computed, and the results can be reported. A logistic regression-based nomogram was developed for patients at high risk to stratify their survival and mortality rates. Using the integrated features from CXR and clinical data as well as the base learners' prediction, logistic regression was used to create a nomogram. In addition, calibration curves for model development and validation were plotted to compare the projected and actual death probability of high-risk patients. In addition, decision curve analysis was utilized to finalize the threshold probability ranges within the clinically useful range of the nomograms.

2.6 Performance metrics

Recall/Sensitivity (R), Precision (P), Accuracy (A), Specificity (S), and F1-Score (F1) were used to evaluate the performance of different classifiers in the literatures [59]. The results of this study were drawn from the full dataset because fivefold cross-validation was used (five test fold-concatenated). Since the number of occurrences in each class varies, we gave weighted values for both classes and total accuracy. Area under the curve (AUC) was used in judging the model performance. Equations (37) depict the mathematical expressions of five evaluation metrics:

$$ A = \frac{{{\text{TP}}_{{{\text{class}}\_i}} + {\text{TN}}_{{{\text{class}}\_i}} }}{{{\text{TP}}_{{{\text{class}}\_i}} + {\text{TN}}_{{{\text{class}}\_i}} + {\text{FP}}_{{{\text{class}}\_i}} + {\text{FN}}_{{{\text{class}}\_i}} }} $$
(3)
$$ P = \frac{{{\text{TP}}_{{{\text{class}}\_i}} }}{{{\text{TP}}_{{{\text{class}}\_i}} + {\text{FP}}_{{{\text{class}}\_i}} }} $$
(4)
$$ R = \frac{{{\text{TP}}_{{{\text{class}}_{i} }} }}{{{\text{TP}}_{{{\text{class}}_{i} }} + {\text{FN}}_{{{\text{class}}_{i} }} }} $$
(5)
$$ F1 = 2\frac{{{\text{Precision}}_{{{\text{class}}_{i} }} \times {\text{Sensitivity}}_{{{\text{class}}_{i} }} }}{{{\text{Precision}}_{{{\text{class}}_{i} }} + {\text{Sensitivity}}_{{{\text{class}}_{i} }} }} $$
(6)
$$ S = \frac{{{\text{TN}}_{{{\text{class}}\_i}} }}{{{\text{TN}}_{{{\text{class}}\_i}} + {\text{FP}}_{{{\text{class}}\_i}} }} $$
(7)
$$ {\text{where}}\;{\text{class}}_{i} = {\text{Mild}}\;{\text{and}}\;{\text{severe}}\;{\text{or}}\;{\text{survived}}\;{\text{and}}\;{\text{death}} $$

where \({\text{TP}}_{{{\text{class}}\_i}} \) is true positive, indicating correct detection of the actual class, \({\text{TN}}_{{{\text{class}}_{i} }}\) is true negative, indicating correct detection of the other classes, \({\text{FP}}_{{{\text{class}}\_i}}\) is false positive, indicating incorrect detection of the other classes, and \({\text{FN}}_{{{\text{class}}\_i}} \) is false negative, indating incorrect detection of the actual class.

3 Results

3.1 Best features and their combination selection

The random forest feature ranking technique was used to select the top-ranked ten features from 25 statistically significant features (Fig. 6). Moreover, we used few fine-tuned parameters for Random Forest features selection technique using Optuna optimizer [71] and we trained the algorithm with ‘n_estimators’ = 75, ‘criterion’ = ‘entrophy’, and ‘max_depth’ = 50. Table 3 shows the results of testing these top-ranked 10 features with multiple classifiers to determine the best-performing feature combinations. When using the top-ranked 5 features, the Gradient Boosting classifier outperforms other networks in binary classification (low- vs. high-risk). Gradient Boosting produces overall accuracy, weighted sensitivity, precision, specificity, and F1 scores of 82.91%, 82.91%, 82.87%, 82.91%, and 82.87%, respectively, when only the Top-5 characteristics are used (LDH, O2 percentage, WBC, Age, and CRP). Among the Top-10 features, determining the most appropriate parameters for the early prediction of high-risk COVID-19 patients was critical.

Fig. 6
figure 6

Top ten features selected using the random forest feature selection technique

Table 3 Summary of the performance metrics for Top 1 to 10 clinical features

3.2 Risk prediction of COVID-19 patients

In this section, the results of three different experiments to predict low or high-risk COVID-19 patients were reported. The performance of different ML models for CXR images, then using clinical data was reported separately and in combination. Each of these results is based on fivefold cross-validation.

3.2.1 Performance analysis using CXR images

The gradient boosting classifier was the best performing classifier for stratifying the low- and high-risk COVID-19 patients. It achieves precision, sensitivity, and F1 scores of 78.41%, 78.48%, and 78.41%, respectively. The stacking model was built using the top three classifiers such as Random Forest, KNN, and Gradient Boosting. The stacking model produces slightly better performance with precision, sensitivity, and F1 scores of 79.5%, 79.53%, and 79.54%, respectively.

3.2.2 Performance analysis using clinical data

The gradient boosting classifier outperforms other classifiers in binary classification with precision, sensitivity, and F1 scores of 82.81%, 82.8%, and 82.81%, respectively. The stacking model was trained using the top three algorithms (Random Forest, Gradient Boosting, and XGBoost). A meta-learner logistic regression classifier was used and outperformed the base model with precision, sensitivity, and F1 scores of 83.01%, 83.87%, and 83.01%, respectively.

3.2.3 Performance analysis using both CXR images and clinical data

The gradient boosting classifier outperforms other classifiers with precision, sensitivity, and F1 scores of 88.81%, 88.81%, and 88.81%, respectively, using combined CXR characteristics and clinical data. The stacking model was built using the top three algorithms (Gradient Boosting, LDA, and Random Forest) and it outperforms the base models and produces precision, sensitivity, and F1 scores of 89.03%, 90.44%, and 89.03%, respectively. Using a combination of CXR and top-ranked clinical characteristics, the stacking model revealed around a 6% improvement. Table 4 compares, with a 95% confidence interval, the prediction of low- or high-risk patients using CXR characteristics and clinical data alone and in combination with different classifiers employing distinct metrics.

Table 4 Comparison of performance metrics for risk prediction using different ML models and approaches (single mode and multimode)

In Fig. 7, it can be seen that combined CXR image features and clinical top-ranked features outperformed individual modality with an AUC of 91.5%. The AUC values for CXR image features and clinical top-ranked features individually using the stacking model produced 82.3% and 85% of AUC, respectively.

Fig. 7
figure 7

ROC curves for risk prediction of COVID-19 patients with single and multi-modal data using the stacking ML model

3.3 Death probability prediction for high-risk patients

In this section, the results of three different experiments to predict the probability of death among high-risk COVID-19 patients were reported. The fivefold performance of different ML models for CXR images, then using clinical data were reported separately.

3.3.1 Performance analysis with CXR images

Random Forest classifier outperforms the other 7 classifiers in classifying the dead and survived COVID-19 patients with precision, sensitivity, and F1 scores of 84.83%, 85.02%, and 84.83%, respectively. The stacking model was built using the top three methods (Random Forest, Extra Tree, and Gradient Boosting) and produces precision, sensitivity, and F1 scores of 86.35%, 83.22%, and 86.35%, respectively.

3.3.2 Performance analysis with clinical data

The gradient boosting model outperforms the other seven classifiers in stratifying the survival and dead patients with precision, sensitivity, and F1 scores of 89.14%, 89.86%, and 89.14%, respectively. The stacking model was trained using the top three models (Random Forest, XGBoost, and Extra Tree). The stacking model beat previous base models, achieving 91.2% precision, 91.25% sensitivity, and 91.2% F1 scores, respectively.

3.3.3 Performance analysis using both CXR images and clinical data

Random Forest classifier outperforms other models with precision, sensitivity, and F1 scores of 91.76%, 91.86%, and 91.76%, respectively. The stacking machine learning model was trained using Random Forest, Extra Tree, and Gradient Boosting and it outperforms the base model with precision, sensitivity, and F1 scores of 92.88%, 93.37%, and 92.88%, respectively. In terms of all the different performance metrics, the performance of the stacking model improved by ~ 6% when using both reduced CXR features and clinical top features, refer to Table 5. Moreover, the finetuned parameters for the best performing classifiers are shown in Supplementary Table 1.

Table 5 Comparison of performance metrics for death prediction using different ML models and approaches (single mode and multimode)

In Fig. 8, it also can be visible that combined CXR image features and clinical top-ranked features outperformed individual modalities with an AUC of 92.8%. The reduced CXR image features and clinical top-ranked features using the stacking model individually produce an AUC of 88.4% and 91.1%, respectively. In this study, the main contributing parameter which helped to improve the result was the PCA variance. PCA was used to reduce the dimensionality of the extracted features from the images using CNN encoder with different variance from 70 to 95% to produce the best performance. The performance for single and multimodal data with different PCA variance using stacking model are shown in Supplementary Tables 2 and 3 for study 1 and 3, respectively.

Fig. 8
figure 8

ROC curves for outcome prediction of high-risk patients with single and multi-modal data using the stacking ML model

3.3.4 Stacking ML-based nomogram

Due to the superior performance of the Logistic regression meta-learner in the classification of survival and death patients, a Nomogram was created using the probability scores of the three best models (Random Forest (M1), Extra Tree (M2), and Gradient Boosting (M3)) to accurately estimate the survival and death probabilities of the high-risk group. Using multivariate logistic regression, the relationship between the probability scores of these base learner models and the likelihood of death in high-risk patients was explored (Table 6). Using the z-value, which is determined using the regression coefficient and standard error, is a common way of detecting relevant characteristics. High z-values indicate that the independent variable is significant.

Table 6 Summary of logistic regression analysis

Table 6 demonstrates that Extra Tree (M2) is not a particularly accurate predictor of COVID-19 individuals, although Random Forest (M1) and Gradient Boosting (M3) are accurate predictors. If p < 0.05, the p value can be utilized to identify a significant variable; X-variables may have a substantial relationship with Y-variables. The p value also demonstrates that the Extra Tree model is a weak predictor. The linear prediction (LP) and Probability of death in high-risk patients (Prob) are calculated using Eqs. 89.

Nonetheless, it was noticed that the model performance was marginally diminished when two models were stacked instead of three. Therefore, three models are utilized to produce the Nomogram. As seen in Fig. 8, the nomogram comprises six rows, running from 1 to 3, to represent the variables included. The “Points axis” produced a score for each variable in the high-risk death or survival category. The total score was displayed in row 6 after being determined by adding the points from the three factors (row 4). To determine a patient's mortality risk, a line is drawn from the “Total Score” axis to the “Prob” axis (row 5).

Alternatively, the following formula can be used to calculate the nomogram score:

$$ {\text{LP}} = 11.23907 - 14.85299 \times M1 - 5.028269 \times M2 - 1.788734 \times M3 $$
(8)
$$ {\text{Prob}} = 1/ \left( {1 + \exp \left( { - {\text{LP}}} \right)} \right) $$
(9)

Figure 9 also depicts the Nomogram scores for both the survived and the deceased classes. It was discovered that 50% classification probability cutoffs correspond to a Nomogram score of 4.8 or a probability of 0.5, stratifying the classes.

Fig. 9
figure 9

A Nomogram for prediction of death in COVID-19 severe patients was created using Random Forest (\(M1\)), ExtraTree (\(M2\)), and Gradient boosting (\(M3\))

Figure 10 depicts both the internal and external validation calibration plots. It demonstrates that each calibration curve is extremely near to the diagonal line, indicating a valid model. The AUC values for internal and external validation are 98.1% and 93.8%, respectively, which demonstrates the model’s exceptional performance.

Fig. 10
figure 10

A Internal validation calibration plot, B External validation calibration plot

Figure 11 illustrates that the net benefit of each predictor model was positive (threshold 0.95), showing that each predictor contributed to the prediction of the outcome. Particularly, the whole model produced the most accurate results, necessitating the employment of three base models as predictors in the Stacking model.

Fig. 11
figure 11

Decision curves analysis comparing different models to predict the death probability of patients with high-risk COVID-19

3.3.5 Performance evaluation of the model

Using the Nomogram score, we compared the actual death rate to the projected death rate among high-risk people. Table 7(A) demonstrates that the proportions of death outcomes in the training set were 91.9% (125/136) for the death group and 8.1% (11/136) for the surviving group, while the proportions of death outcomes in the test set were 91.18% (31/34) for the death group and 8.82% (3/34) for the survived group (Table 7(B)). The actual mortality rates varied considerably between the two groups (p 0.001). Consequently, this scoring method can be utilized to predict patient outcomes.

Table 7 Performance evaluation of the model in the training cohort (A) and testing cohort (B) using Fisher’s exact probability test

3.3.6 Web application with back-end server

As an extension of this work, we developed an online application (https://qu-mlg.com/projects/covid-severity-grading-AI) that allows clinicians to input demographic and clinical data (LDH, O2 percentage, WBC, age, and CRP) as well as CXR images. BIO-CXRNET is a Google Cloud-based AI application that analyzes data to determine whether a user is a low-risk or high-risk patient. Our model identifies the patient's death risk probability if the patient is in the extreme risk group.

The backend application is written in Python using the Flask framework. Python’s Flask is a strong backend application framework. The cloud application is deployed on an Apache 2.0 HTTP server using Ubuntu 20.01 LTS Google Computation Engine (GCE). To reduce server costs, a GCE instance with minimal configuration is hired. The GCE server is equipped with a 4-core Intel Xenon processor, 8 GB of DDR4 memory, and 100 GB of balanced persistent storage. To handle the computation-intensive ML models in a resource-constrained context, the operating system kernel configurations are adjusted. Such setups include activating non-threaded pre-forking for the Apache web server so that Tensorflow processes have access to more RAM. This online application was developed with Flutter, a programming language based on Google’s Dart.

In the prototype system, screenshots of the system can be seen in Supplementary Fig. 1, radiologists/clinicians/users will submit demographic information before being prompted to upload CXR image file and four biomarkers, including Lactate Dehydrogenase (LDH) (U/L), Oxygen Saturation (%), White Blood Count (WBC) (10^9L), and C-Reactive Protein (CRP) (mg/dL). CXR image file means the user will give a Chest X-ray image as an input in .png or .jpg format. This will be uploaded to the server, where it will be pre-processed and applied to the BIO-CXRNET model to assess whether the user is a patient at low or high risk (Fig. 12). The data will be processed by the AI backend, and the screen will display a response. The application will display and store the results in a local SQLite database. In conclusion, the application can reduce the load on the healthcare system by expeditiously analyzing the severity risk of COVID patients using a minimum number of blood signs.

Fig. 12
figure 12

COVID-19 severity risk detection tool using web application framework

4 Discussion

This study proposes a multimodal system for forecasting the risk of COVID-19-positive individuals and, as a result, stratifying the probable outcome of patients at high risk. Using CXR images and clinical data, the performances of both of the tests were examined separately and in conjunction. Both experiments demonstrated that the multimodal strategy outperformed the single modality. CXR and clinical features combined demonstrated an accuracy of 89.03% for risk group stratification among COVID-19 patients, compared to 80.11% and 86.01% for CXR and clinical features, respectively. Moreover, in the instance of outcome prediction for high-risk patients, the multimodal technique exceeded individual modality with a 92.3% accuracy, whereas CXR pictures and clinical data alone achieved an accuracy of 89.5% and 90.11%, respectively. As indicated in Table 8, the performance of the results provided in this work is superior to several state-of-the-art performances published in the literature.

Table 8 Comparison with state-of-the-art works in the literature

In our previous studies [75] on severe acute respiratory syndrome (SARS) [76], the Middle East respiratory syndrome (MERS) [77], and COVID-19 [78], we discovered that greater age predicted poor outcomes in COVID-19 patients. Since LDH signals tissue/cell death, it is a common indicator of tissue/cell damage. Serum LDH has been recognized as a critical biomarker for the activity and severity of idiopathic pulmonary fibrosis. According to Yan et al. [72], the increase in LDH is one of the most significant prognostic markers of lung injury in patients with the severe pulmonary interstitial illness. The increase in LDH levels in seriously ill COVID-19 patients suggests a worsening of lung injury.

According to studies conducted by Lu et al. [79], CRP testing upon admission is connected with the prediction of short-term mortality related to COVID-19-related diseases. Hepatocytes manufacture CRP when stimulated by cytokines originating from active leukocytes, such as those produced by infections, inflammations, or tissue injury. Hepatocytes manufacture CRP when stimulated by cytokines originating from active leukocytes, such as those produced by infections, inflammations, or tissue injury. Our study indicated that elevated CRP levels upon admission were related to an increased risk of mortality among COVID-19 participants. These data indicated that these patients had developed a significant inflammation or maybe a secondary infection, and antibiotic treatment may be necessary. Increased CRP, a significant indicator of poor prognosis in acute respiratory distress syndrome, suggests a chronic inflammatory state [80, 81]. As a result of this continuous inflammatory response, COVID-19 individuals develop massive gray-white lesions there [82].

Based on prior research, the five biomarkers found in our study were connected with inflammation, immunology, and coagulation function, all of which may play a role in COVID-19 etiology. We hypothesized that the inflammatory response to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection is fundamental to COVID-19 pathogenesis and that dysregulation of the immune and/or coagulation systems result in severe clinical outcomes, such as Acute respiratory distress syndrome (ARDS), coagulopathy, and septic shock, among others. Patients who died showed lower WBC and O2 percentages, as well as higher age, CRP, and LDH values than survivors. High-mortality-risk COVID-19 individuals may benefit from early treatment based on a comprehensive evaluation of the inflammatory response, immunologic dysfunction, and coagulopathy. As anticipated, the combination of clinical information and chest X-ray pictures aids in the accurate diagnosis of COVID-19 severity and mortality risk.

Additionally, our nomogram is applicable in a range of therapeutic contexts. To our knowledge, it outperforms other models proposed in the literature. In addition, the score of the nomogram served as a quantitative tool for identifying patients with a high risk of mortality upon admission and for guiding clinical management. COVID-based hospital admission information, 19 individuals were assigned to risk groups. In isolation centers, low-risk cases should be isolated and treated. For comprehensive care, survivors from high-risk categories should be admitted to a hospital with an isolation unit. The high-risk group is referred to the intensive care unit (ICU) for intensive treatment and assistance.

5 Conclusion

The current gold standard for the identification of coronavirus illness (COVID-19) is the reverse transcription-polymerase chain reaction (RT-PCR) test, despite its drawbacks, which include a longer turnaround time, greater false-negative rates of 20–25%, and more expensive equipment. In addition, the detection of COVID-19 involves physical examinations, radiographic imaging, blood testing, and the reverse transcription polymerase chain reaction (RT-PCR) technique. Using clinical data, CT radiographic imaging, and sign symptoms, the severity of COVID-19 has been determined. There have been researches employing DNA-based methods, however, such data are not readily available, and genome sequencing investigations are computationally costly. The objective of this study is to develop a multimodal system that combines both Chest X-ray (CXR) pictures and clinical data to predict the severity of COVID-19 infection in patients. The severity classification method described by employing commonly available and less expensive radiological imaging (Chest X-rays compared to CT) and a smaller number of biomarkers that may be easily obtained from Common Blood Count tests is unquestionably a life-saving and cost-effective option. In addition, the technique improves the accuracy and dependability of the diagnosis. The proposed architecture makes use of CXR pictures and only five parameters: LDH, O2%, Age, WBC, and CRP, and demonstrates exceptional results for recognizing low- and high-risk COVID-19-positive individuals with extremely high sensitivity. Moreover, the proposed nomogram-based technique accurately predicts the likelihood of death among high-risk people. Our prognostic nomogram for COVID-19 patients displayed excellent discrimination and calibration based on many risk markers. Since the model utilizes CXR pictures and clinical factors, it can refute the physicians' complaints regarding the use of merely radiographic images for prognostic purposes. This approach can determine a patient's probable risk upon admission, which can considerably improve hospital resource management. Although the study used data from initial variants, the clinical biomarkers identified in this work are supported by a large pool of clinical studies conducted on other variants; as a result, we anticipate that this model will be equally applicable to Omicron and other future variants that may emerge in the upcoming winter. As a result, physicians could use this technique to make a swift and objective determination to enhance patient stratification management and possibly reduce death rates. However, this quantitative tool should be tested in large-scale prospective multicenter and multi-country trials to verify its clinical utility.