1. Introduction
Deep learning [
1,
2,
3,
4,
5,
6,
7] has been used to learn prognostic subtypes of glioblastoma using pan-cancer gene expression data from The Cancer Genome Atlas (TCGA) [
8], predict drug synergy based on cancer cell gene expression data [
9], and predict survival based on multi-omics integrated data in liver cancer [
10] etc. [
11,
12,
13]. It is important to explain the model in a meaningful way to understand the deep learning model and its limitations. A typical deep learning model involves millions of parameters, which makes it a difficult task to understand. We propose to use feature importance ranking within the deep learning model. While feature importance ranking is popular with machine learning [
14], its use within the deep learning model is rare, especially in cancer genomics where a model usually includes thousands of features. In this paper, we expand the permutation feature importance techniques to deep learning. Our goal is to study the inside of a trained deep learning model to discover prognostic gene features in glioblastoma.
Conventional machine learning approaches have been used to determine the gene expression that are prognostic to glioblastoma patient survival [
15,
16,
17]. Glioblastoma gene expression regression modeling using the least absolute shrinkage and selection operator (Lasso) flavor strategy performed better when cancer pathway genes were used as input variables compared to whole genome input [
16]. However, these types of penalized regression methods often require dropping a large number of genes in order to fit the survival outcome, hindering biological pathway interpretation and introducing random bias toward the selected gene factors. Deep learning offers the capacity to model a large number of differentially expressed genes, is less susceptible to multicollinearity problem, and generalizes better. Deep learning based on transcriptome data has only recently been used to determine the primary effects of gene features that are prognostic to survival of glioblastoma (GBM) or other cancer types [
8,
18]. However, few studies work on what was learned in these deep learning models.
Using differentially expressed genes from the GBM-specific TCGA database as inputs, we tested the hypothesis that deep learning can model the relationship between specific genes and the corresponding protein effect to predict patient survival prognosis. Like most of the deep learning models, our model learned a set of features at the last hidden layer, which in this case linearly modulates the survival risk of patients. We hypothesize these features contains the key factors that determine patients’ overall survival for those who have gone through standard of care therapy, from surgery to chemotherapy, without undergoing targeted therapy.
2. Results
2.1. Deep Learning Model
We trained and optimized the deep learning model to generate a network architecture consisting of an input layer feeding to two hidden layers (82 nodes then 27 nodes) and connected to one single output node to predict patient survival prognosis.
The validation concordance index is 0.69, the corresponding training concordance index is 0.73, and the validation concordance index of each sample partition is evaluated to be with a mean ± 1 SD concordance index of 0.70 ± 0.07. The out-of-sample testing concordance is 0.63 and is within the uncertainty of the validation concordance. The 95% confidence intervals of all testing, validation, and training concordance indexes do not include 0.5.
Important genes that contribute to the overall deep learned model were identified according to the input permutation method. A frequency analysis of genes occurring in the last hidden layer is listed in
Table 1. The top 10 ranked genes are either known to be associated with glioblastoma survival, glioblastoma cancer cell migration, or glioblastoma cancer stem cells, or are known to be related to other types of cancer with mechanistic significance. The top 10 genes are
TNR,
GAD1,
TMSB15B,
POSTN,
SCG3,
PLA2G2A,
NNMT,
CHI3L1, and
ELAVL4. The ranked gene importance at each network node is available in
Supplementary Table S1. The entire frequency analysis table is available in
Supplementary Table S2.
In the deep-learned model, no proportional hazard assumption is used. Since the last hidden layer network node outputs are combined linearly with a weight vector to predict patient survival prognosis, the weight vector determines the relative risk from each of the 27 network nodes. Putting these network nodes in a Cox proportional hazard model showed a model concordance of 0.71, with 14 out of 27 network nodes statistically significant using Wald test at
p < 0.05; only network node 10 is stratified for all network nodes to satisfy the proportional hazard assumption at
p < 0.05 (
Table 2).
2.2. Deep Learning Model Performance Comparison with Penalized Cox Regression Models
Using the training dataset, the k-fold cross validation training resulted in the ridge, adaptive Lasso, and elastic net Cox regression models. These models perform consistently well with concordance indexes at 0.70, 0.69, and 0.70, respectively and are comparable to the deep learning model in performance. In the testing dataset, however, ridge, adaptive Lasso, and elastic net Cox regression models resulted in concordance indexes of 0.58, 0.56, and 0.56, with the 95% confidence intervals of the model concordance index including 0.5. In another word, these models performed close to random and failed to predict survival in the testing dataset, which could be due to the challenge of prediction in a small dataset with 49 patients. This is in sharp contrast to the deep learning model which still performs well in the small testing dataset.
2.3. Network Node Parameters Improved the Baseline Cox Proportional Hazard Model
A baseline Cox proportional hazard survival model constructed with clinical covariates, including age, gender, Karnofsky Performance Status (KPS), tumor subtypes, and therapy options performed at a concordance value of 0.68. Age, gender, KPS, chemoradiation, and proneural subtype all reached statistical significance using Wald test at p < 0.05, with no parameters violating the proportional hazard assumption at p < 0.05. Since most patients with the Glioma CpG island methylator phenotype (G-CIMP) mutation also harbor the IDH1 heterozygous Arg132-to-His (R132H) point mutation, the multicollinearity between these two variables (correlation coefficient = 0.845) led to large standard error on the parameter estimation, affecting statistical significance.
After adding all deep-learned network nodes, the baseline Cox model (
Table 3) concordance index increased from 0.68 to 0.76 with age, KPS, chemoradiation, and proneural subtype statistically significant (Wald test,
p < 0.05); four risk-increasing network nodes (8, 17, 24, and 25) and four risk-decreasing nodes (4, 10, 16, and 23) were also statistically significant (Wald test,
p < 0.05), with all covariates satisfying the proportional hazard assumption.
2.4. Prognostic Significance Validation of Gene Set with External Data
The chance of a gene deemed important to a small number of network nodes is much higher than it is to a large number of nodes, as shown in
Figure 1. We used a probability threshold of
p < 0.01 and selected a set of genes that are important in at least 10 out of 27 network nodes. These important genes form a 39-gene signature and they are listed in
Table 4. Using this gene signature, statistical significance (log-rank test,
p < 1.5 × 10
−6) was achieved in seven glioblastoma studies and three low-grade glioma studies in separating the low-risk and high-risk groups with mean ± 1 SD concordance index of 0.79 ± 0.09. The Kaplan–Meier survival curves of the low-risk and high-risk groups in these studies are shown in
Figure 2. There are no notable pathway enrichment in the 39-gene signature.
In the discovery process, two particular genes stand out with high hazard ratio (HR) and concordance index (CI) in univariate Cox model, AQP1 (HR = 3.3, p < 0.05, CI = 0.711) and MEOX2 (HR=2.5, p < 0.05, CI = 0.675), both of which are lesser known to be associated with survival in gliomas. Both genes are statistically significant and independently prognostic to survival in the multivariate Cox model (MEOX2 HR = 2.2, AQP1 HR = 1.55, both p < 0.05, CI = 0.796) in the TCGA GBMLGG dataset.
3. Discussion
Prior reports of deep learning model in cancer research [
8,
18,
19] were derived from multiple cancer types to increase sample numbers, but fall short in identifying genes or primary mutations of mechanistic interest. A recent deep feedforward network studies used gene expression feature selection and outcome classification in TCGA breast cancer data as well as in TCGA kidney renal cell carcinoma [
20]. In that study, only importance features to the model output were discovered and a number of biologically relevant features were found.
In this paper, we expand the input permutation method for feature importance ranking in deep learning network. Input permutation method is a useful technique for feature importance ranking in machine learning [
14] and is broadly applicable to various models. It is usually used to rank feature importance at the model output. Expanding this method to rank features at any hidden layer within a deep learning model opened up many possibilities. It solidifies our understanding of the model and helps in explaining the deep learning model, which is usually considered a black box. In the proposed GBM model, we found that the last hidden layer contains important features that are far more biologically relevant than those obtained from the output layer.
Compared with gene-signature from previously identified GBM molecular subtypes (classical, proneural, and mesenchymal) [
17,
21,
22], our 39-gene signature has a small overlaps with the 840-gene signatures in classical (8 out of 210), proneural (6 out of 210), and mesenchymal (5 out of 210) subtypes, or 12 genes overlap among all subtypes. The other 27 prognostic genes are likely due to shared biological mechanism(s) among tumor subtypes that are crucial to patient survival. The improvement in model concordance from 0.68 to 0.76 with the addition of deep learned network parameters confirmed the prognostic value of these additional genes in additional to known tumor subtype. It is also quite remarkable that the 39-gene signature learned from one GBM dataset is able to stratify patients in several other low grade glioma datasets as well as other GBM datasets. In addition, two lesser known genes,
AQP1 and
MEOX2, are discovered to be prognostic to gliomas patients overall survival through the deep learning approach.
Our deep learning features revealed many genes of interest to glioblastoma stem cells mechanism. For example, both glutamate decarboxylase 1 (
GAD1) and Chitinase 3 Like 1 (
CHI3L1/YKL-40) have been recently identified as targets of Notch inhibitors (alpha secretase and gamma secretase inhibitors) in treating glioblastoma stem cells. Notch inhibitors work via Notch binding to YKL-40 and leukemia inhibitory factor (LIF) promotors and increased survival in a GBM stem cell orthotopic mouse model [
23]. Epigenetic upregulation of glutamate decarboxylase 1 (
GAD1) has been shown to program the aggressive features of cancer cell metabolism in brain metastatic microenvironment [
24]. Chitinase 3 Like 1 (
CHI3L1/YKL-40) was also reported previously to be prognostic to glioma patient survival [
25] and involved in the angiogenesis, radioresistance, and progression of glioblastoma in vivo [
26]. Periostin (
POSTN) has been shown to impact GBM stem cell tumorigenicity and GBM patient survival [
27].
POSTN is secreted by glioblastoma stem cells to recruit tumor-associated macrophages in order to promote malignant growth [
28] and regulate tumor resistance to anti-angiogenic therapy [
29]. Nicotinamide N-methyltransferase (
NNMT) was also reported to regulate mesenchymal glioblastoma stem cell maintenance by depletion of methionine and shift tumor towards a mesenchymal phenotype and accelerated tumor growth [
30].
NNMT was reported to be a prognostic marker for glioblastoma [
31], inhibiting tumor suppressor protein phosphatase 2 (PP2A) at the epigenome and proteome level and concomitantly activates prosurvival serine/threonine kinases. Receptor tyrosine-protein kinase ErbB-3 (
ERBB3) is known to mediate glioblastoma cancer stem-like cell resistance to EGFR inhibition [
32].
Brevican (
BCAN) which is known to bind to tenascin-R (
TNR) with high affinity [
33], is highly expressed in gliomas, initiating cells’ extracellular niche in human GBM tumors and is expressed by glioma initiating cells in vitro [
34]. Though
BCAN knockdown does not affect glioblastoma initiating cell viability in vitro [
34], it promotes glioma cell adhesion and migration in vitro [
35] and the knockdown of the gene inhibits both cell motility in vitro and tumorigenicity in vivo [
35]. Tenascin-C (
TNC) and tenascin-R (
TNR) are two of the three members of the tenascin family of extracellular matrix glycoproteins. TNC (which occurred in four networks) has been shown to promote glioblastoma invasion [
36] and is heavily involved in pro-angiogenic and anti-angiogenic signaling in glioblastoma [
37], as well as having an impact on survival [
38]. A strong
TNR expression is linked to non-invasive brain tumor (pilocytic astrocytomas) whereas a weak expression is detected in glioblastoma [
39]. The exact role of
TNR, particularly in glioblastoma stem cells extracellular niche, is a subject worth exploring.
On the other hand, ELAV-like RNA binding protein 4 (
ELAVL4) has been shown to modulate radiation sensitivity in vitro in non-small cell lung cancer [
40]. Secretogranin III (
SCG3) has been shown to be involved in anti-angiogenesis in diabetic retinopathy [
41]. Secreted phospholipase A2 group IIA (
PLA2G2A) was shown to induce phosphorylation of the
EGFR to induce proliferation through a PKC-dependent pathway in human astrocytoma in vitro [
42]. Interestingly, the thymosin beta 15B (
TMSB15B) is involved in epidermal growth factor-induced migration of prostate cancer cells [
43].
Deep learning models that are fully connected and have high dimensional inputs are notoriously difficult to train due to their large number of variables. In our case, the number of variables, about 300,000, is much larger than the number of cases (n = 492). Our model is able to generalize on out-of-sample patient cases and has comparable performance in validation concordance index across different data sample splits. Due to the limitation of a single gene chip platform used in this study, its out-of-sample performance on another chip platform may need evaluation.
There are other limitations on the performance of deep learning based on differentially expressed genes. For instance, with 2-fold change cutoff used in this study, we may be artificially removing genes that are prognostic to survival. In addition, the relatively small number of patient samples in this study limits the depth of the model.
Finally, deep learning without explicit biological knowledge or network architecture constraints is not expected to learn biological structure within the data, so care must be taken with biological interpretation. In our case, we identify the prognostic genes by identifying genes with disproportional impact to last hidden layer and the occurrence frequency. Lastly, using deep learning to extract prognostic differential expressed genes for survival prediction provides a flexible way to combine gene expression data with other clinical covariates such as age, KPS, therapy options, and tumor subtypes to enable better patient survival stratification.
4. Materials and Methods
4.1. Gene Expression Data Analysis
The Cancer Genome Atlas (TCGA) is a publicly repository with patient-derived clinical, imaging, and genomic data that has been deidentified and contains no linkage to patient identifiers, no institutional review board or Health Insurance Portability and Accountability Act approval was required for our study. Microarray data from untreated glioblastoma patients (
n = 492) were retrieved from TCGA. Gene expression level 1 data from the Affymetrix Human Genome U133A platform were used. The data were processed by software script using R (version 3.2,
https://www.R-project.org/, Vienna, Austria), affy [
44] and affycoretools [
45] packages, and quality control was conducted using affyQCReport package. The probe level data were normalized by gcrma [
46] package to control for batch variations, and the probe level expressions are compared to the normal brain tissue group (
n = 10). Statistically significant gene probe changes were selected with a threshold of
p < 0.01 with at least 2-fold biological change adjusted for multiple comparisons using the beta uniform mixture model [
47]. Genes with a Pearson correlation coefficient higher than 0.8 were represented by one gene to reduce the strong intrinsic correlation. The number of gene probes was reduced to 3581 and used as model inputs.
Clinical data such as age, gender,
MGMT methylation, G-CIMP,
IDH1/2 mutation, cancer subtype, and therapy information were retrieved from a TCGA publication [
22].
4.2. Deep Learning Model
The deep learning model was built using Tensorflow 1.3 and Python 3.6 platforms using the deep learning survival modeling framework [
48]. Gene expression dataset is randomly partitioned into ten equal partitions, with the first as the testing set, the second as the validation set, and the remaining as training sets. Stratified sampling was used to preserve the survival time distribution among each data partitions to fully capture the heterogeneity from multiple sites. The testing set contained samples from 11 sites whereas the validation set contained samples from 12 sites. The network structure consists of an input layer, one/two hidden layers with rectifier linear unit (Relu) functions, and an output layer with a single node corresponding to the survival prognosis of each patient. The partial likelihood function was used as a loss function and an
L2 penalty was applied on all network weights to prevent overfitting, [
49] retaining the advantage of interpretation like Cox’s proportional hazard. Batch-normalization [
50] was used in each layer to improve learning stability.
Network structures, including number of hidden layers and number of hidden nodes, are varied to arrive at different models with a maximum of two hidden layers. Hyperparameter tuning on hidden layer(s), nodes and learning rate used concordance index as performance criteria in both the training and validation datasets. A maximum of 1000 epochs were allowed for computation convergence. The optimum hidden layers and nodes were determined by the maximum concordance index in the validation dataset; parameters were stored as a model for testing.
The gene expression dataset was randomly split to training, validation, and testing datasets with ratios of 80%, 10%, and 10%, respectively, while preserving the distribution of survival time to maximize the available datasets for training. The performance variability of the deep learning model was tested by rotating each sample partition as a validation dataset while using the rest for training. After the validation statistics were evaluated, its performance was tested on an out-of-sample data set that was never used in the modeling process.
4.3. Deep Learning Model Performance Comparison with Penalized Cox Regression Models
A comparison was made between deep learning model and penalized Cox regression models, including ridge, adaptive Lasso [
51], and elastic net [
52] using the glmnet package [
53] (version 2.0.12).
Cox regression method assumes a semi-parametric hazard form of:
where
is the hazard for patient I at time t,
is the baseline hazard, and
is a fixed length vector of length p. In penalized Cox regression, models are fitted by maximizing the penalized partial log-likelihood function. The penalized partial log-likelihood function is given by:
where
is the penalty function with tuning parameters
λ and
α.
For ridge regression, the penalty function takes the form:
For adaptive Lasso regression, the penalty function takes the form:
where
.
the initial estimated of
, which in this case is estimated by ridge regression.
For elastic net, the penalty function takes the form:
where
.
Model performances were evaluated using concordance index and the same survival time stratified training/validation/testing datasets are used as in the deep learning model for a fair comparison. A 9-fold cross validation was used. A minimum lambda model was chosen as the lamda.1se models are not numerically stable. Feature importance of Lasso, adaptive Lasso, and elastic net methods was identified by the absolute amplitude of regression coefficients.
4.4. Impact of Deep Learning Network Features on Baseline Survival Model
The hidden nodes’ outputs are Relu functions that are positive or zero, functioning like an on/off switch that allows effects from a previous level of interacting genes to pass through. Whether these network signatures provide distinct or complementary prognostic value to the baseline survival model was evaluated using Cox proportional hazard model.
A baseline Cox model was constructed from clinical covariates, tumor subtypes, therapy options and genetic mutation/methylation status known to be associated with patient survival. Clinical and genetic mutation/methylation covariates include age, Karnofsky performance scale, gender, IDH1/2 mutation status, MGMT, G-CIMP methylation status, tumor subtypes (proneural, mesenchymal and classical) as well as chemotherapy/radiation/chemoradiation therapy options. To evaluate the additional prognostic value of deep learning network nodes, they were added to the baseline Cox proportional hazard model. The contribution of these deep learning features in improving survival prediction over the baseline model was evaluated using the concordance index.
4.5. Identifying Important Genes in Deep Learning Model
To identify the genes important to survival, we permuted the input genes one gene at a time to break the correlation between the input gene and the output risk [
54]. An important gene that contributes significantly to the overall model when permuted across the patient group will impose a significant change to the predicted patient survival, whereas an unimportant gene will not. The process was repeated five times and the average change in patient risk was used. The high impact genes were identified as those that affect predicted patient survival outside the 95% confidence interval of the average change due to single gene permutation.
4.6. Prognostic Significance Validation of Gene Set with External Data
To compare the prognostic significance of our gene set in predicting survival in glioblastoma patients, we evaluated it in seven glioblastoma and three low-grade glioma studies (
Table 5) using Cox proportional analysis with SurvExpress platform [
55]. The samples were split by the median of the prognostic index to designate low-risk and high-risk groups. The top ranked 39 genes, which correspond to
p < 0.01 or equivalently any gene occurring at least 10 times in the 27 network nodes, were chosen as gene biomarkers. The gene biomarkers were evaluated for survival difference between the low-risk and high-risk groups using log-rank test. The prognostic index is the linear component of the exponential function in the Cox model.