Introduction
Depression is a disease that threatens the mental health of modern people and is recognized as a problem that needs to be solved, but there is a lack of understanding and agreement on the proper treatment for depression. Depression leads to a decline in the functions of daily life, with its main symptoms of being demotivated and feeling sad or unhappy.
1 The World Health Organization (WHO) expects depression to be the most burdensome disease for humans in 2030, with more than 264 million people in the world suffering from it.
2 WHO also notes that, globally, depression is a major cause of disability and contributes to the burden of disease on people.
2 According to the (US) National Institute of Mental Health (NIMH), 17.3 million people, accounting for 7.1% of the adult population in the United States, have had at least one major depressive episode.
3 Additionally, 3.2 million American teenagers, which account for 13.3% of the population between the ages of 12 and 17 years, suffer from the same symptoms.
3 What is more concerning is that 35% of adults and 60% of adolescents in the United States who suffer from depression are not receiving proper treatment, even though depression occurs in various age groups.
3 As such, the mental health status of modern people has emerged as a major social problem and has begun to be perceived as a problem that can no longer be ignored.
Despite concerns about depression, there is no definitive treatment for mental illness in many countries. In 2016, Mental health service utilization rates for those diagnosed with mental illness were 43.1% in the United States, 46.5% in Canada, 34.9% in Australia, 35.5% in Spain, and 22.2% in Korea,
4 indicating that mental health service utilization is significantly lower than 50% worldwide. According to Andrade et al.,
5 the three main reasons for the low utilization of mental health services are low perceived need, structural barriers, and attitudinal barriers. Low perceived need is the lack of awareness of mental health issues and means the patient himself/herself thinks no help is needed. Structural barriers represent concerns about money, lack of time, accessibility, insurance coverage, etc. Attitudinal barriers include the idea that the mental disorder will improve itself, prejudice against mental health services, and distrust of treatment effects, the result of which is that patients fail to use the service. Unlike with physical diseases, patients suffering from mental illness often do not understand the extent of the disease, often do not receive treatment due to low motivation, and often do not know that problems can be improved by using mental health services. In general, mental illness is fully treatable by early intervention, but the later the treatment, the more serious the disorder can be,
6 so proper awareness and early detection of mental illness are important and necessary steps in treating the disease. Furthermore, recognizing the disease and knowing its exact name increases the probability of early detection and enhances positive treatment effects.
7
One of the important steps in treating depression is correct self-awareness of this condition. Self-diagnosis of depression allows people to check their degree of depression themselves, and there are many self-diagnosis tests for depression. Examples of self-diagnosis instruments for depression include the Beck Depression Inventory (BDI),
8 the Center for Epidemiologic Studies Depression Scale (CES-D),
9 the Patient Health Questionnaire-9 (PHQ-9),
10 and the Geriatric Depression Scale (GDS).
11 There are various self-diagnosis tables for depression, but it is difficult for people with mental disabilities to identify their condition through this method for the same reasons as the low utilization rate of mental health services. Thus, an alternative could be a system that automatically (without specific patient involvement) identifies depression levels in such patients.
In fact, there have been many attempts to predict or detect depression through various techniques.
12–18 In recent years, the expansion of social media such as Twitter and Facebook has raised interest in automatic depression detection techniques.
12 As social media has become an integral part of modern life, much data is produced, suggesting that considerable textual data are available for mental health analysis. This can be seen as a valuable resource for depression and mental disability assessment through text, the direction our research aims to pursue. Existing studies on text analysis for depression or mental disorder
12–16 were conducted by establishing a classifier to determine whether the text is related to symptoms of depression, and to further assess the degree of concern for depression. Sentence classification has been carried out through techniques such as naïve Bayes classification (NBC), latent Dirichlet placement (LDA), support vector machine (SVM), and logistic regression, by building vocabularies with relevant experts. In particular, in a study conducted by Yazdavar et al.,
13 by whose work we were most inspired, the PHQ-9 was used as a text classification criterion. In our study, the performance of sentence classification was improved by applying a better natural language processing (NLP) method to that preceding study, and the resulting sentence classification was expanded to a model that can judge depression based on the sentence classification results. Therefore, the purpose of our study was to associate textual data with the nine symptoms of depression in the PHQ-9 through NLP techniques, and to identify users’ depression based on their results.
The remaining sections are organized as follows. Related research section introduces the diagnosis of depression and various depression self-diagnosis tables. We also analyze the applications of NLP in health care and reference prior studies on online depression detection. Depression classification model section introduces our depression classification model and describes the model details and mechanisms. Experiments section describes the experiments to evaluate our model and analyze its results. The academic and practical significance of this study are described in Discussion section, and Conclusion section concludes with a summary of key findings and future research proposals.
Depression classification model
The development process of a new depression classification model
We herein propose a three-step process for detecting and analyzing social media users’ depression. In
Figure 1, each of the three stages has two parts: training the model used for depression detection and then predicting depression using the model trained in the previous step. The “sentence classifier training phase” (SCT) and the “depression classifier training phase” (DCT) are in the first training phase, and the “user's depression classification phase” (UDC) is in the prediction phase. First, we train two sentence classifiers in the SCT phase. Two sentence classifiers are used to classify sentences based on symptoms of depression, the Y/N classifier, and the PHQ-9. The Y/N classifier determines whether or not a sentence is related to depression, and the 0–9 classifier determines, based on the symptoms given in the PHQ-9, which question(s) of the PHQ-9 is/are related to the sentence. The 0–9 classifier classifies a sentence into one of 10 categories that correspond to each symptom in the PHQ-9, with 0 being a class that is not related to one of the PHQ-9 symptoms.
Next, in the DCT phase, a logistic regression classifier is trained to determine whether or not a user is depressed. To train the logistic regression classifier, a set of user-generated social media text data and a sample user’s PHQ-9 score are necessary. The target variable of the logistic regression classifier is the likelihood that the user is depressed. The likelihood of depression is 1 if the user’s the PHQ-9 score is ≥5; otherwise it is 0. Finally, the UDC phase uses the previously trained sentence classifier and the depression classifier to predict one user’s depression.
Sentence classifier training phase
To train sentence classifiers, social media texts that describe daily life were collected from the Internet. The collected text data were separated into sentences, which were then preprocessed by removing stop words and spell checking. Then three people who received data labeling training read each sentence individually and assigned a Y/N label, that is, a "Y" when is the sentence was related to depression, and "N" when it was not. For each sentence related to depression (“Y”), a label(s) of “0” through “9” were assigned for each of the PHQ-9 symptom(s) it reflected. If the assigned labels were different among the three people, the final labels were assigned through discussion among them.
To train the sentence classifiers, BERT, Word2Vec, and Unicode word embedding methods were used, and the applied classifiers were NBC, SVM, RNN, LSTM, CNN, and BERT. Finally, the model with the highest accuracy was selected as the final model.
Depression classifier training phase
To train the depression classifier, user PHQ-9 scores and social media text data for 2 weeks were necessary. We collected social media text data from 30 adults who, based on their PHQ-9 scores, were judged to have depression and 30 adults who were not. The collected textual data were preprocessed in the same way as in the SCT phase, and were then classified using the trained Y/N classifier and the 0–9 classifier. The ratio of each user’s classified sentences was calculated based on the number of sentences classified Y/N and 0–9 and the total number of sentences the user had written. Then a logistic regression classifier was trained with depression as a dependent variable and the ratio of each label (Y/N and 0–9) as an independent variable. To determine the final logistic classifier, statistically significant coefficients were selected stepwise based on the variability inflation factor (VIF) and p-value for each variable. Due to the small number of users (60 adults), we performed fivefold cross-validation to improve the reliability of the results.
User’s depression classification phase
Each user’s social media text data for 2 weeks was used for input. This data was preprocessed the same way as in the SCT phase, and were then classified using the trained Y/N classifier and the 0–9 classifier. Based on those classification results, input variables of the logistic regression classifier were calculated. Through the logistic classifier, the user was categorized as being depressed or not.
Experiments
Experimental designs
The experiment was approved by Hanyang University Institutional Review Board (IRB) and the approval number was HYUIRB-202008-001. Participants were informed of the detailed experimental purpose and procedure, and written consent was obtained.
Training the sentence classifier based on depression symptoms: The sentences to be labeled were collected from the representative Korean blog sites Naver Blog, Naver Cafe, and Daum Cafe. Sentences not related to depression were collected through the daily Naver Blog, and sentences deemed to be depressed were collected through Naver’s and Daum’s depression-related cafes.
For each user posting, we collected the user ID, URL, upload time, title, and content. We collected 23,115 documents and separated them into sentences using the Python library Korean Sentence Splitter (KSS). The total number of collected sentences was 249,103. The collected data were labeled in two steps. First, the sentences are labeled based on whether they are related to (Y) or not (N) with depression. When a sentence was labeled Y, the second labeling indicated which of the nine symptoms of the PHQ-9 corresponded to the sentence. We added a 0 label for those Y sentences that did not correspond to one of the PHQ-9’s 1–9 symptoms. According to the above rules, it was labeled independently by three workers who had basic knowledge of NLP and learned PHQ-9. Each worker read the given sentence, labeled Y/N according to whether it was related to depression, and if Y, labeled it in a category between 0 and 9 according to the criteria of PHQ-9. When there exists inconsistency in the labeled results, it was labeled as ground truth according to the majority vote between the three workers for the same sentence. If all workers were labeled differently, the ground truth was determined through discussion between the three workers.
Once the collected data were labeled, we found there was a significant data imbalance: there were significantly fewer sentences reflecting depression than those that did not. A more severe imbalance was found in sentences pertaining to the PHQ-9 symptoms: among the sentences tagged Y for depression, very few were labeled with symptoms other than 0, 1, or 9. To resolve these data imbalances, under-sampling was performed on the sentences that were not related to depression (tagged N). The details of the final dataset after under-sampling are shown in
Table 1.
Training the depression classifier: For this experiment, we recruited blog users who had written daily online articles in the 2-week selection period. A total of 60 adults over the age of 19 were selected, 30 who were considered depressed and 30 who were not. Whether a person was currently experiencing depression was determined based on their PHQ-9 test results (≥5 = depressed) acquired in the application stage. Since PHQ-9 diagnoses 5 points as mild depression, we also determined depression based on 5 points. And, to ensure the reliability of the PHQ-9 results, we administered a second PHQ-9 test three days after the first one. Considering that PHQ-9 diagnoses depression based on symptoms of 2 weeks, we collected all textural data from the users’ blogs over the 2 weeks prior to the PHQ-9 test date. The data collected in this way were divided into sentence units and preprocessed and organized by experimenter.
These data were used to train and evaluate the logistic regression classifier with fivefold cross-validation. First, a baseline depression classifier using only the Y/N classifier was created, without the 0–9 classifier, to compare the performance of the proposed logistic regression classifier. Then, after training each logistic regression classifier, the accuracy of the two models was compared.
The reason for using fivefold cross-validation here is that the number of data was too small. We used cross-validation to overcome this and ensure reliability of the model’s performance. Also, we used fivefold cross-validation instead of commonly used 10-fold cross-validation to secure enough test dataset. If the 10-fold cross-validation was used, the size of the test dataset would have been too small at 6, but by using fivefold cross validation, we obtain relatively sufficient size of 12. The main experiment was conducted with fovefold cross-validation, and experiments on 10-fold cross-validation and 3-fold cross-validation were also conducted. The results of the experiment are presented in the Appendix.
Experimental results
Performance of the sentence classifier: To find the best-performing sentence classifiers, we conducted experiments with various embeddings and classification algorithms. NBC, SVM, RNN, LSTM, BiRNN, and BiLSTM classifiers were trained with Word2Vec embedding, and CNN 1D and CNN 2D were trained with Unicode embedding. In the case of BERT, the classifier was implemented by adding a linear layer to the last layer of KoBERT, a Korean BERT released by SKT. As shown in
Figure 2, the experimental results showed that BERT classifiers were the best for both Y/N and 0–9 sentence classification.
In
Table 2, the accuracy of the BERT-based Y/N sentence classifier was 93.68%. The precision of N was 96%, which was greater than the precision of Y, while the recall of Y was 96%, which was greater than that of N (91%).
In
Table 3, the accuracy of the BERT-based 0–9 sentence classifier was 83.29%. However, because the accuracy of the Y/N sentence classifier was 93.68%, the actual accuracy was 93.68% × 83.29% = 78.02%.
In addition, when viewed through
Figure 3, the F1-scores were different among the symptoms because this is influenced by how distinct the symptom-specific features are. In the case of symptom 0, various symptoms were mixed and they did not correspond to numbers 1 through 9, so the characteristics were not clear, resulting in the lowest F1-score. Symptom 1 related to depression, symptom 2 related to reduction of interest, and symptom 5 related to psychomotor agitation or retardation also had low F1-scores because they were difficult to extract from textual data. On the other hand, F1-scores were high for symptom 3, which related to significant weight loss, symptom 4 related to insomnia, and symptom 9 related to thinking about or attempting suicide or death, because they were relatively easy to identify from textual data.
Performance of the depression classifier: The baseline depression classifier had two variables: S (the number of sentences) and Ratio_D (the ratio of the sentences related to depression to the total sentences). When viewed through
Table 4, among five folds, Ratio_D was selected three times after variable selection. Even though in the two cases, Ratio_D was not selected as a significant variable, if the number of data is sufficient, Ratio_D can certainly be chosen as a significant variable.
On the other hand, the proposed depression classifier had three variables: S (the number of sentences), Y (the number of depression-acknowledged sentences), and Ratio_n (the number of sentences classified as nth symptoms/total number of sentences). Variable selection steps are performed for five folds. The results show that Ratio_1, Ratio_2, Ratio_3, and Ratio_6 are significant variables. The coefficients of Ratio_1 and Ratio_6 are positive, and that of Ratio_2 and Ratio_3 are negative. That is, higher proportion of sentences on symptoms 1 and 6 among the entire sentences increases the probability of depression. On the contrary, higher proportion of sentences of symptoms 2 and 3 reduce the chance of depression. It can also be seen that the absolute values of the Ratio_2’s coefficients are significantly larger than that of other variables, which means the ratio of sentences on symptom 2 is more sensitive than that of other symptoms.
The average accuracy of the proposed depression classifier was 68.3%, which was 15% higher than that of the baseline depression classifier (53.3%). In
Figure 4, for all cases in the fivefold cross-validation, the accuracies of the proposed depression classifier were always higher than those of the baseline depression classifier. Therefore, the user’s depression could be more accurately classified when adding label-specific ratios obtained through the 0–9 sentence classifier, rather than only using the Y/N sentence classifier.
Discussion
This study aimed to determine whether a user’s depression can be predicted based on text written on social media. Our study results indicated that this is possible, using NLP and machine learning techniques. This study contributes to early depression identification, which is a significant step in the treatment of depression. And the methodology described here can be applied without the conscious participation of the user.
There are currently many mental health online applications (“apps”) that can automatically analyze users’ emotions and detect mental disorders, and the model proposed here can be included in many of them. In the case of mental illness, it is important to constantly scan for mental conditions and get professional help to prevent mental deterioration. Therefore, our model can be used for mental health care services and apps for people suffering from mental illness. Although it is necessary that some users provide their PHQ-9 scores and their social media text for the training purpose, after the training phase, the classifiers can be used to determine whether other users are depressed or not solely based on their social media text.
In addition, if there were systematic disease indicators for various diagnoses, more diverse mental disorders could be analyzed online in similar ways. For example, in this study, we used the PHQ-9 but we might also be able to create other models using BDI,
8 SDS,
25 CES-D,
9 and GDS.
11 In addition to depression, self-diagnoses of panic disorder, anxiety disorder, stress, bipolar disorder, etc. can be established from social media texts.
This study simply tries to classify whether a user has depression or not. However, in future research, we can extend our model by combining various technologies, which will be more helpful for the early detection of depression and preventing it from worsening. Another possible future avenue is “explainable artificial intelligence” (XAI), which is a set of processes and methods that allows human users to comprehend and trust the results and output created by machine learning algorithms.
75 If the development of technology allows us to identify the causes of depression through XAI, it would contribute to the improvement of mental health through customized treatment and emotional management.
Conclusion
In this study, we created a model to determine whether or not social media users are depressed, by analyzing their past social media texts. The proposed model consists of three classifiers: the Y/N sentence classifier which determines whether or not a text sentence is related to depression, the 0–9 sentence classifier which classifies a text sentence according to the depression symptomology in the PHQ-9, and the Depression classifier, which ultimately establishes whether or not a social media user is potentially depressed. To improve the sentence classification accuracy, we tried various text classification algorithms; among them, BERT-based classifiers showed the best performance for both the Y/N and 0–9 sentence classifiers. In particular, the accuracy of the sentence classifier of Yazdavar et al.,
13 which is the basis of this paper, was 68%, whereas our sentence classifier showed 83.29% accuracy, which is approximately 15% higher performance. Of course, since it was not compared with the same dataset, it is difficult to compare the performances directly. In addition, it is necessary to verify the proposed approach with other data sets. However, currently, there are no available open depression data sets to perform such verification. Lastly, the depression classifier, which is a logistic regression classifier, also showed that sentence classification based on the PHQ-9 is helpful to improve prediction accuracy.
The most significant limitation of this study was that the social media textual data of only 60 users were used in the Depression classifier’s training. To overcome this, fivefold cross-validation was performed, but with more data, it would have been possible to train the model more stably and achieve certain results without having to use k-fold cross-validation. There was also a limitation in that the proposed model is a binary classification of whether or not a user is depressed. Finally, the contribution of this paper at the methodological level is limited because the main purpose of this study is to improve the performance of the Depression Classifier on users’ social media text data by applying the state-of-the-art NLP techniques. Although there was a significant improvement in the performance of the depression classifier from the proposed approach, future studies can improve the performance using emerging advanced techniques.