1 Introduction

Up to date, there have been 524,878,064 confirmed cases of COVID-19 (CO ronaVI rus D isease 2019), including 6,283,119 deaths, reported by World Health Organization (WHO) [1], and cases are still increasing worldwide. Therefore, rapid and self testing tools for COVID-19 are more important than ever before not only to confirm cases, but also to prevent the spread by taking precautionary steps such as quarantine, self isolation amongst the others.

The main COVID-19 diagnostic tool is presently the RT-PCR test (i.e. the swab test), yet this is an expensive and invasive test, and most of the time it requires to take the patient to a test centre, which might not be feasible in many cases because of the severity of the case, moving ability of the patient etc.

Due to the aforementioned reasons, there have been several attempts to develop rapid, portable and self-applicable COVID-19 diagnostic tools. However, this is a challenging task, since the COVID-19 virus has intertwined symptoms with influenza or other respiratory diseases and cannot be distinguished easily, especially in winter season (when influenza or respiratory diseases circulate more). The primary symptoms of COVID-19 are loss of smell, loss of taste, continuous cough, fever and high temperature [2]. Vaccine, wearing mask and social distancing are the primary measures taken for the controlling the spread of this disease [3]. Hence, the clinicians/physician are in urgent need of novel tools to diagnose/confirm COVID-19 cases.

For this reason, machine learning–based methods have recently received a great attention for COVID-19 diagnosis. Amongst them, analysis of coughing audio signal has a tremendous importance, as COVID-19 is predominantly manifest as coughing, and it has unique characteristics belonging to COVID-19 [2, 4, 5].

In this manuscript, we come up with a deep learning–based COVID-19 diagnostic/confirmation tool that utilizes coughing audio signal. Our aim is to increase the detection accuracy so that the transmission of the virus can be diminished by taking necessary steps. The proposed method utilizes multi-branch Neural Networks by combining the features that are extracted from different domains of the coughing audio signal. Technical details of each part of the proposed approach are presented in Section 3 and validated on several datasets (see Sections 4.1 and 5) against state-of-the-art deep learning–based architectures explained in Section 5.1.

2 Related works

RT-PCR tests are used to validate COVID-19 infection. One other diagnostic tool for COVID-19 is medical imaging systems such as chest X-ray and CT. Furthermore, these approaches validate misclassifications resulting from RT-PCR tests. However, these are even more expensive tools and cannot be used on some vulnerable patient groups (such as pregnant) due to the radiation emitted. The need for computerized analysis for fast and accurate diagnosis comes to the fore during this pandemic. Several works using automatic deep learning algorithms on CT scans [6,7,8,9,10] and machine learning algorithms on cough sounds [11,12,13,14,15,16,17,18,19,20,21,22] are proposed in literature. The works on CT scans [6,7,8,9,10] provide information about the degree of severity of the individual’s lung damage. In a recent survey [23], numerous studies and open source datasets on CT have been examined. According to [23], it is reported that open cough-based COVID-19 datasets are few and their sizes are small. In [11], the authors combined handcrafted features with Visual Geometry Group (VGG) features and reached an Area Under the Curve (AUC) of 0.82 using only 86 cough samples. Feeding MFCC spectrograms as input to Convolutional Neural Network (CNN) architecture, an accuracy of 92.85 % was reported on 543 cough sounds (of which 70 of them was COVID-19) in the work of [12]. In [13], an ensemble of CNN classifiers is employed and an AUC of 0.77 is reached on 1502 recordings. Using CNN-based deep learning model, an AUC of 0.71 is reached on 1486 samples fusing voice, coughing and breathing information in [14]. On a dataset of 1273 samples, the work of [15] employed cough-specific CNN, pre-trained Residual Network (ResNet) model, gender-specific pre-trained ResNet model achieving an AUC of 0.62, 0.70 and 0.71, respectively. In [16], using MFCC features as input and Support Vector Machine (SVM) as classifier with the advantage of the speech enhancement technique, accuracies of 74.1 % and 85.7 % are reached on two separate datasets. However, the highest value of 0.5144 was reached as AUC value on cough sounds. Using MFCC as input to ensemble CNN model based on ResNet50, the authors in [17] achieved an AUC of 0.97 on their private dataset consisting of 5320 subjects. Employing an architecture like LeNet-1, an accuracy rate of 97.5 % is reached on a small test set of 18 samples in [18]. In [19], an AUC of 0.846 is reached on 517 samples using breath and cough information. However, using only cough recordings of 53 subjects, an AUC of 0.57 is achieved employing ResNet-based model. In [20], using cough sounds of 76 post-COVID-19 and 40 healthy subjects an accuracy of 67 % is obtained utilizing VGG19 CNN model. In [21], an AUC of 0.771 is reached on a total of 2883 cough sounds using MFCC and extra features such as the presence of respiratory diseases, fever, and muscle pain. A detailed comparison of the related works is given in Table 1. When we consider the studies using crowdsourced data in the literature, we found that the CNN-based study with the largest number of publicly available samples is [21]. Therefore, we used the work of [21] as the baseline comparison and referred to it as the Baseline Model throughout the manuscript. We noticed from the existing studies that the proposed deep learning models are not validated whether they are generalizable or not with testing unseen datasets.

Table 1 A detailed comparison of the related works on cough-based (top part) and image-based (bottom part) approaches for COVID-19 detection

3 Proposed models

In this paper, we develop four alternative deep learning–based COVID-19 detection models, namely MFCC-based, Spectrogram-based, Chromagram-based and ensemble MSCCov19Net models after deep investigations, analysis and trials. We ultimately test and compare their performances through successive experiments.

3.1 MFCC-based model

MFCCs are well-known hand-crafted attributes that have been observed to be one of the most useful features in the area of audio signal processing [24,25,26]. They are extracted from mel-frequency cepstrum (MFC), which can be defined as a short-term characterization of the power spectrum of an audio waveform, founded on a direct cosine transform of a log power. For this work, we extract 39 MFC coefficients from a coughing audio signal. To extract the MFC coefficients, we use Python-librosa audio signal processing package [27]. More precisely, the coughing audio waveform is first resampled to 22.5 KHz, then the feature extraction function is applied to the signal by using a hop length of 23 ms, window length of 93 ms, and a Hann window type. The output MFCC features are then averaged along the time-axis and converted into 1D 39 coefficients.

For this model, we create a small Multi Layer Perception (MLP) network (as it can be seen in Fig. 1), which consists of 4 fully connected (FC) layers and a single output layer. Each FC layer contains 1024, 2048, 512 and 512 nodes respectively with Rectified Linear Unit (ReLu) activation functions and dropout layers. The last dense layer is a single node with Sigmoid activation that gives the probability of the covidness of the given cough signal.

Fig. 1
figure 1

The complete architecture of the MFCC-based model

3.2 Spectrogram-based model

Spectrogram

is a visual illustration of the range of frequencies of a given waveform while it changes with time. To put it another way, it can be thought as a 2D signal that shows the relation between time and frequency. Recently, spectrograms have been used as input to many CNN architectures to achieve various ultimate goals, including speech recognition [28, 29], speaker verification [30, 31] and speech enhancement [32, 33]. Inspired by the previous motivational works, we propose to use spectrograms to extract meaningful information from cough audio signals. Spectrograms are extracted via librosa library by using the previously obtained MFC coefficients, then rescaled to 128 × 40 and normalized to [0,1] before inputted to the proposed network.

As Spectograms are 2D signals, we create a small Convolutional Deep Neural Network (CDNN) architecture (as it can be observed in Fig. 2), which is inspired by the seminal VGG network [34,35,36]. The network contains 3 convolutional layers, a flatten layer, and 3 FC layers and a single output layer for classification. The convolutional layers contain composite functions that consists of a convolution function with 32, 64, and 64 filter sizes respectively, a ReLu activation layer, a max pooling layer of 2 × 2 filter size with stride 2 and a Batch Normalization layer. The FC layers, on the other hand, contain 256, 64 and 64 nodes respectively with ReLu activation functions and dropout layers. The last dense layer is also a single node with Sigmoid function. The spectrogram images are rescaled to 128 × 40 to feed the network.

Fig. 2
figure 2

The complete architecture of the spectrogram-based approach

3.3 Chromagram-based model

Chromagram

which can also be expressed as Harmonic Pitch Class Profile, contains the energy distribution of an audio wave along the pitches [37]. Chroma-based features are highly used on audio signals to analyze meaningfully categorizable pitches [38,39,40]. For this work, we extract 12-element 1D features (via librosa) from each coughing audio signal and use them as input to feed the proposed network.

For this model, we use a similar MLP architecture to MFCC-based model. The model consists of 4 FC layers and a single output layer. Each FC layer contains 1024, 2048, 512 and 512 nodes respectively with ReLu activation functions and dropout layers. The last dense layer is a single node with Sigmoid function. The utilized model can be seen in Fig. 3.

Fig. 3
figure 3

The complete architecture of the chromagram-based approach

3.4 Ensemble MSCCov19Net model

We finally propose a multi-branch CNN architecture called MSCCov19Net to detect COVID-19 from a given coughing audio signal only. The proposed architecture combines previously explained neural features extracted from diverse domains: MFCC features, Spectrogram images and Chroma features (Chromagram) of the coughing audio signal.

The overall architecture of the proposed ensembled neural network can be seen in Fig. 4. The proposed MSCCov19Net network consists of three branches, and each branch extracts distinct and informative neural features from aforementioned sources, then these neural features are concatenated and sent to the classification network.

Fig. 4
figure 4

The complete architecture of the proposed MSCCov19Net network

The first branch extract neural features \({F_{n}^{1}} \in \mathbb {R}^{C_{1}^{\prime }}\), where \(C_{1}^{\prime }=256\) from MFCC, using 2-layers of dense nodes. The first dense layer consists of 512 nodes with ReLU activation operation and a Dropout layer. The second dense layer, on the other hand, contains 256 nodes with ReLU activation function and a Dropout layer. Dropout layers are used to mitigate overfitting.

The second branch extracts neural features \({F_{n}^{2}} \in \mathbb {R}^{C_{2}^{\prime }}\), where \(C_{2}^{\prime }=256\), from spectrogram images of size 128 × 40. The network contains 3-layers of composite functions, a flatten layer, a dense layer of 256 nodes and a Dropout layer. Each composite function consists of a convolutional layer with 32, 64, and 64 filter sizes respectively, a ReLu activation layer, a max pooling layer of 2 × 2 filter size with stride 2 and a Batch Normalization layer.

The architecture of the third branch to extract neural features of size \({F_{n}^{3}} \in \mathbb {R}^{C_{3}^{\prime }}\), where \(C_{3}^{\prime }=256\), from Chroma-based features is similar to the MFCC branch. The model consists of 2-layers of dense nodes with 512 and 256 nodes respectively. Each layer also contains a ReLU activation function and a Dropout layer.

Finally, extracted neural features are combined to create a composite neural feature vector as follows:

$$ \mathcal{F} = \left[ {F_{n}^{1}}; {F_{n}^{2}}; {F_{n}^{3}}\right ], $$
(1)

where [;] depicts the concatenation operation, \({F_{n}^{1}} \in \mathbb {R}^{256} \), \({F_{n}^{2}} \in \mathbb {R}^{256}\) and \({F_{n}^{3}} \in \mathbb {R}^{256}\) are the extracted neural features from MFCCs, Spectrogram images and Chroma-based features respectively, and \(\mathcal {F}\) is the composite neural feature vector of size 768 × 1. The extracted composite neural feature vector \(\mathcal {F} \in \mathbb {R}^{768}\) is then sent to the classification network which is a shallow network consisting of fully connected layers. More precisely, it contains 2-layers of dense neural blocks of which 64 filters with ReLU activations and Dropout layers. The last node is a single unit neural block with Sigmoid function, which gives the probability of being COVID-19 positive for a given coughing audio signal.

4 Experimental setup

4.1 Datasets

Aiming to train and validate the proposed methods, we employ several publicly available datasets: Coughvid [41], Coswara [42], Virufy [21] and NoCoCoDa [43].

Coughvid

is a crowdsourced dataset that contains 20,072 audio data. 1010 labeled COVID-19, 8562 labeled healthy, 1742 labeled symptomatic and 8758 of them have not been labeled. Some of the files in the Coughvid dataset include non-cough sounds and environmental noise. In order to have clean data for training, 651 COVID-19-labeled files and 660 healthy-labeled audio files were manually selected. Symptomatic labeled and unlabeled files were excluded from this study. Coswara dataset contains data from 1503 patients. Each with the following: deep breathing, shallow breathing, heavy cough, shallow cough, counting from zero to twenty slow and fast, vowel phonation for letters “a”, “e” and “o”. For this paper, we used heavy and shallow cough sounds. For experiments, “positive_asymp”, “positive_mild” and “positive_moderate” labeled data were used for COVID-19 class while the remaining for healthy class. Virufy is a clinical dataset. It contains data that is acquired in the clinical environment from 16 patients. Seven of them labeled positive and 9 of them labeled negative for COVID-19 which is validated using PCR test results. NoCoCoDa is a non-clinical dataset. There are 73 annotated cough events obtained from 10 patients. This dataset contains only COVID-19 positive reflex cough sounds. The cough segments are annotated from online media interviews and background noise such as talking or music is present on some of the records. In this study, the original recordings were used without any pre-processing step in order to alleviate the present noise. The types of sound files in the Coughvid database are .webm and .ogg, in the Coswara database are .wav, in the Virufy database are .mp3, and in the NoCoCoDa database are .wav format. We converted all the files to .wav format without applying noise reduction to the sounds.

For training purposes, the Coswara and Coughvid datasets are combined, and divided into training-validation-test groups by using 80%-10%-10% split ratio. The Virufy and NoCoCoDa dataset, on the other hand, are only used for inference to cross-validate the proposed model. In other words, none of the data from Virufy and NoCoCoDa is used in the training step but in testing step.

We extract cough segments using the provided code presented in the Coswara/Coughvid dataset in order to carry out data extraction. Total number of extracted segments are 2960/370/370 respectively for training/validation/testing.

4.2 Implementation details and training procedure

The proposed deep learning model is implemented using Tensorflow 2.3.0 Python library. A binary cross-entropy loss function and a Stochastic Gradient Descent (SGD) optimizer are utilized for training purposes. We use an adaptive learning rate strategy by starting at 0.1 that is divided by 10 at every 100 epochs with a batch size of 8. The network is trained 1000 epochs.

In order to have the optimum performance from the proposed model, we use the combination of the following hyper-parameters:

  • Optimizers: Adadelta, Adam, Adamax, RMSprop, SGD

  • Activation Functions: ReLU, Sigmoid, Softmax, Softplus, Tanh

  • Dropout Rate: 0.0, 0.3, 0.5, 0.8, 0.9

Optimal hyper-parameters described above (optimizer, activation function, dropout rate, etc.) are chosen via grid search strategy.

In order to avoid overfitting and provide a better generalization, we use data augmentation before, and regularization during the training phase. Data augmentation is conducted through “Pitch Shifting” and “Noise Addition” by applying them to cough audio signals on the training dataset, and regularization is conducted via a composite element \({\mathscr{L}}_{\mathcal {R}}\) on the classification network, which is defined as,

$$ \mathcal{L}_{\mathcal{R}} = \lambda_{1} L_{1} + \lambda_{2} L_{2} $$
(2)

where λ1 and λ2 are the weighting coefficients and set to 0.01 empirically.

5 Experimental results

Aiming to evaluate the proposed approaches quantitatively, we present accuracy (Acc) scores along with the Area Under the Curve (AUC) scores on Coswara/Coughvid, Virufy and NoCoCoDa test sets with the model trained only on Coswara/Coughvid dataset.

Firstly, we discuss the individual performance of each proposed model. The results on Coswara/Coughvid dataset are represented in Table 2. It can be concluded that the quantitative results validate the robustness of the proposed multi-branch network by obtaining 4.5 % increase on classification accuracy and 2.9 % increase on AUC score from the second best approach (MFCC-based). The table also shows that the Chroma-based model presents the lowest performance, probably due to environmental noise (speech and music) in the crowdsourced datasets, in fact this behaviour (sensitiveness to noise) is reported before in [44, 45]. On the other hand, the multi-branch model provided a significant increase in performance thanks to multiple perspectives and diverse domain knowledge obtained from 1D and 2D features. As, to the best of our knowledge, there is currently no approach that uses Chroma-based features for this task specifically, we re-train the network with and without Chromagram branch in order to see the effect of Chroma features on the multi-branch network MSCCov19Net. It can be also noted from the table that there is a significant improvement in performance when Chroma-based features are used on the multi-branch network MSCCov19Net. Although we do not have a clear explanation for this particular behaviour, boosted performance unlike the low performance of its standalone usage, we believe that using multiple perspectives and diverse domain knowledge on the multi-branch network MSCCov19Net mitigates the noise sensitivity of the Chroma-based features.

Table 2 Comparison of the proposed model with the individual branches tested on the Coswara/Coughvid dataset (trained on Coswara/Coughvid)

To further assess the proposed approach, we make a cross-validation test on Virufy (clinical and clean) dataset. In other words, we train the proposed method on Coswara/Coughvid dataset and test only on Virufy dataset.

The cross-validated outcomes on Virufy dataset are represented in Table 3. It can be concluded that, the quantitative results on the table confirmed the robustness of the proposed multi-branch method by obtaining 3.4 % increase on classification accuracy of the second best approach (MFCC-based) and 8.4 % increase on AUC score from the second best approach (Chromagram-based). Since the Virufy dataset is obtained in a clinical and controlled environment, environmental noise is minimum, so we can speculate that this is the reason why the Chroma channel has a noticeable higher AUC score than the previous experiment.

Table 3 Comparison of the proposed model with the individual branches tested on the Virufy dataset (trained on Coswara/Coughvid)

5.1 Performance comparison with the state-of-the-art architectures

In this part, we compare the performance of MSCCov19Net with diverse deep CNN architectures: a basic CNN model [36], ResNet50 [46], EfficientNetB0 [47], MobileNetV2 [48], Xception [49] and the baseline model proposed in [21].

ResNets explore residual operations based on layer input, in lieu of exploring non-referenced operations as in traditional CNN layers. EfficientNet uses a scaling approach that adjusts all sizes of input resolution, depth and width employing combined coefficients. MobileNetV2 is a 53 layers deep CNN that aims to operate well on mobile resources. Xception is a 71 layers deep CNN that relies solely on depth-wise separable convolution layers. The methods used for comparison are trained using similar hyper-parameter settings as the proposed method.

We train ResNet50, EfficientNetB0, MobileNetV2 and Xception models using the Adam optimizer with learning rate of 0.001, binary cross-entropy loss and batch size of 32. Basic CNN architecture consists of three convolution layers with node sizes of 128, 256, 256 and kernel sizes of 3. Followed by three FC layers with node sizes of 256 and an output layer with 1 node. ReLU activation function and 0.5 dropout rate were used on all layers except for the last layer. In last layer, we used Sigmoid function. For this network, an SGD optimizer with 0.01 learning rate is used.

We first illustrate the results on Coswara/Coughvid dataset in Table 4. As it can be inferred from the table, our approach reaches the highest accuracy and AUC scores among all other approaches. Furthermore, we show the receiver operating characteristic curve (ROC) plot in Fig. 5 (left). It can be clearly concluded that the proposed method has a significant overall increase on both accuracy metrics.

Table 4 Comparison with the state-of-the-art methods tested on the Coswara/Coughvid datasets (trained on Coswara/Coughvid)
Fig. 5
figure 5

ROC curves (left figure) of each individual approach (in Table 4) trained and tested on Coswara/Coughvid dataset and ROC curves (right figure) of each individual approach (in Table 5) trained on Coswara/Coughvid dataset and tested on Virufy dataset

Evaluation scores with the state-of-the-art models on Virufy test set can be examined in Table 5. Likewise, the quantitative outcomes on the table confirmed the robustness of the proposed model by yielding 1.7 % increase on classification accuracy of the second best approach (Baseline Model) and 4.2 % and 7.8 % increase on AUC scores from the second best approach (Basic CNN) and the third best approach (Baseline Model), respectively. Similarly, ROC plots can be seen in Fig. 5 (right).

Table 5 Comparison with the state-of-the-art methods tested on the Virufy dataset (trained on Coswara/Coughvid)

Moreover, aiming to see the efficiency of the proposed approach, another unseen test dataset called NoCoCoDa (non-clinical), is employed. The classification accuracy of the compared approaches can be observed in Table 6. The proposed model outperforms the Baseline Model with a 21.9 % accuracy increase representing superior generalization ability. Besides that, the performance of the proposed method is 13.7 % higher in accuracy than the basic CNN. It can be concluded from Table 6, MSCCov19Net yields superior and promising results on an unseen test set considering the previous studies on cough-based COVID-19 detection. Since NoCoCoDa dataset includes only COVID-positive reported subjects, the AUC scores are not reported.

Table 6 Comparison with the state-of-the-art approaches tested on the NoCoCoDa dataset (trained on Coswara/Coughvid)

The inference and training time analysis of the proposed method along with the state-of-the-art approaches are given in Table 7. Inference time represents the average classification time for a single cough audio signal. On the other hand, training time shows the average training time for a single epoch. It can be concluded that the proposed method is suitable for remote and real-time operation by looking at the inference time information. Experiments were conducted with i7-7700K 4.20GHz processor, 16GB RAM, and GTX1060 6GB GPU.

Table 7 Comparison of the inference time (in milliseconds) and training time (in seconds) of the proposed method along with the state-of-the-art approaches

6 Discussion

RT-PCR technique is a baseline indicator for COVID-19 detection using pharyngeal swabs. However, it suffers from false negative rate, low detection capability and long test result waiting time [50,51,52,53].

While RT-PCR results are negative, results confirming ground-glass symptoms of COVID-19 have been reported in CT results [54]. Therefore, in suspected cases, CT confirmation after RT-PCR is recommended in literature [51, 52]. Although CT presents high sensitivity than RT-PCR [51, 55], it is not practical due to radiation damage, waiting time, and being expensive. Moreover, a patient who has a positive RT-PCR result may has a normal CT result before the beginning of indications as reported in [56]. Lower specificity is one of the main drawbacks of CT-based studies as reported in [57]. Either RT-PCR or CT-based diagnosis requires clinical visit and this situation results in breaching of isolation and social distance rule. This situation also applies to COVID-19 detection based on blood tests obtained by invasive methods, which are used as clinical data [58, 59]. However, machine learning–based remote cough sound analysis is a promising candidate for medical decision support systems for the determination of COVID-19 minimizing clinical visits.

Due to the scarcity of publicly available datasets compared to CT studies [23, 53], COVID-19 detection studies on cough sounds are less than CT studies even though the cough is one of the major symptoms. There are some limitations on COVID-19 determination based on artificial intelligence techniques. At first, outsourced data may include noisy and mislabeled samples reducing the performance of classification models. Secondly, datasets may be imbalanced providing an insufficient number of COVID-19 positive samples. At last, environmental factors may introduce bias when recording cough sounds. Furthermore, once the train, test, and validation sets are not disjoint the performance of the models may be biased [60]. These challenges make cross-datasets validation necessary to obtain robust and generalizable results. Therefore, since crowdsourced datasets include plenty of samples, we cross-validated the performance of the model using a clinical (controlled) and non-clinical dataset that has few samples taking into account the above-mentioned issues.

This study has some limitations due to the available datasets. Working on noisy data which is collected from different mobile devices negatively affects the performance of the proposed model. Furthermore, audio transmission over VoIP or cellular network is subject to compression, which can change the quality of the audio. Consequently, this may also negatively affect the performance of the proposed method.

In the literature, there are few works on cough-based COVID-19 detection. Most of them report their results either on small and controlled datasets or without cross-validating the performance on different datasets. As a solution, we employed the commonly used CNN architectures for comparison by not only training/testing on Coswara/Coughvid dataset but also cross-validating on the Virufy and NoCoCoDa datasets. To the best of our knowledge, this is the first model to cross-validate crowdsourced cough data with clinically validated and non-clinical cough data and use four separate datasets for the purpose of COVID-19 detection using cough sounds.

7 Conclusion

In this paper, we have presented a supervised deep neural network–based cough sound analysis for COVID-19 detection, which provides state-of-the-art performance on the benchmark datasets in metrics such as Accuracy and AUC. The proposed multi-branch network MSCCov19Net has better generalization capability than recent neural network models. As a future perspective, we intend to explore the long-term effects of COVID-19 by applying sound analysis approaches on lung sound data acquired using electronic stethoscopes/phones. This concept might follow the social distancing rule during the sample collection using smart applications or in sheltered cabins. In fact, most of the time lung sounds are collected from the back of the patient, which might minimize to receive saliva droplets that contain the virus by healthcare workers. Ideally, when a specific and acceptable accuracy is reached, these algorithms may be helpful in validating the prediction of RT-PCR tests (as sometimes repeated tests are required due to the mis-classifications [50]) in a remote and non-invasive way. Additionally, in recent years, studies on remote Parkinson’s detection with sound [61] have shown promising results even using the standard telephone network [62]. Suppose a sufficiently diverse and labeled voice dataset can be collected with mobile applications, the proposed deep learning–based system can be embedded within a cloud-based infrastructure, allowing for fast and large-scale screening. Therefore, this system can be trained to give an idea about the degree of involvement in the lungs by confirming it with simultaneous CT acquisition, and it can also reduce the harmful X-ray exposure of individuals. Since the COVID-19 cough sound literature has just begun to develop, it will be extremely important to collect follow-up data of COVID-19 patients for the progression [63] and grading of the disease in the future in terms of person-specific disease follow-up. Recently, Internet of Things–based wireless approaches have been proposed for remote health data monitoring [64, 65]. Thus, it may be adapted for rapid detection of other future diseases that may affect the lungs.