Volume 29, Issue 11 p. 2150-2163
Review
Free Access

Substitution scoring matrices for proteins - An overview

Rakesh Trivedi

Corresponding Author

Rakesh Trivedi

Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics, Uppal, Hyderabad, Telangana, India

Graduate School, Manipal Academy of Higher Education, Manipal, Karnataka, India

Correspondence

Rakesh Trivedi, Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics (CDFD), Uppal, Hyderabad, Telangana 500039, India.

Email: [email protected]

Contribution: Conceptualization, Writing - original draft, Writing - review & editing

Search for more papers by this author
Hampapathalu Adimurthy Nagarajaram

Hampapathalu Adimurthy Nagarajaram

Laboratory of Computational Biology, Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India

Centre for Modelling, Simulation and Design, University of Hyderabad, Hyderabad, Telangana, India

Contribution: Conceptualization, Writing - original draft, Writing - review & editing

Search for more papers by this author
First published: 21 September 2020
Citations: 7

Abstract

Sequence analysis is the primary and simplest approach to discover structural, functional and evolutionary details of related proteins. All the alignment based approaches of sequence analysis make use of amino acid substitution matrices, and the accuracy of the results largely depends on the type of scoring matrices used to perform alignment tasks. An amino acid substitution matrix is a 20 × 20 matrix in which the individual elements encapsulate the rates at which each of the 20 amino acid residues in proteins are substituted by other amino acid residues over time. In contrast to most globular/ordered proteins whose amino acids composition is considered as standard, there are several classes of proteins (e.g., transmembrane proteins) in which certain types of amino acid (e.g., hydrophobic residues) are enriched. These compositional differences among various classes of proteins are manifested in their underlying residue substitution frequencies. Therefore, each of the compositionally distinct class of proteins or protein segments should be studied using specific scoring matrices that reflect their distinct residue substitution pattern. In this review, we describe the development and application of various substitution scoring matrices peculiar to proteins with standard and biased compositions. Along with most commonly used standard matrices (PAM, BLOSUM, MD and VTML) that act as default parameters in various homologs search and alignment tools, different substitution scoring matrices specific to compositionally distinct class of proteins are discussed in detail.

Abbreviations

  • AA2AR
  • adenosine A2A receptor
  • bbTM
  • beta-barrel Transmembrane Matrices
  • BLOSUM
  • BLOcks SUbstitution Matrix
  • EDSSMat
  • Eukaryotic Disorder Substitution Scoring Matrix
  • GPCRs
  • G protein-coupled receptors
  • GPCRtm
  • G protein-coupled receptor transmembrane substitution matrix
  • Hubsm
  • Hubs protein-specific substitution matrix
  • JTT
  • Jones–Taylor–Thornton
  • LOD
  • log odd
  • MSAs
  • multiple sequence alignments
  • PAM
  • Point Accepted Mutation
  • PfSSM
  • Plasmodium falciparum Specific Substitution Matrix
  • PHAT
  • Predicted Hydrophobic And Transmembrane matrix
  • PPINs
  • protein–protein interaction networks
  • S1PR1
  • sphingosine-1-phosphate receptor 1
  • SLIM
  • Scorematrix Leading to Intra-Membrane domains
  • TM
  • transmembrane
  • 1 INTRODUCTION

    The theory that proteins with similar amino acid sequences have similar structures, and/or are related by descent, has been conclusively proved by earlier studies. As a result of these findings over the years, sequence comparison has evolved as a method of choice among biologists to get insights about homology and structure of proteins whose sequences are known. This sequence-structure generalization has received continuous support from the increased number of sequenced proteins with unsolved structures.

    The alignment-based approaches of protein sequence comparisons consist of two major parts: an algorithm to produce alignments, and an amino acid substitution matrix to score aligned pairs of residues.1 Over the years, there have been several attempts to develop new alignment algorithms by modifying the classical dynamic programming algorithm.2 In contrast to traditional dynamic programming algorithm, these modified algorithms score different regions of the protein sequence using distinct substitution matrices,3-5 and also penalize them differently.5, 6 Apart from the efforts exerted to improve these alignment algorithms, there have also been continuous attempts to develop amino acid substitution matrices to search deeper in the evolutionary history of proteins.

    An amino acid substitution scoring matrix defines the rates at which various amino acids in proteins are being substituted by other residues over time. While compiling a scoring matrix the selection of the protein dataset and the residues substitution pairs used to calculate matrix scores are crucial. The dataset must be non-redundant, and the relatedness of proteins at a given level (class, family or superfamily) should be well defined.1 Moreover, functionally conserved amino acids of related proteins should be considered for the computation of substitution probabilities as they represent the real compositions in the proteins. However, the amino acid composition of proteins cannot be generalized at the level of proteome.

    Based on the distinct residue composition, proteins or protein regions have been broadly classified as those with standard and non-standard/biased compositions. The amino acid composition of the large class of globular/ordered proteins or protein segments is generally considered as standard, whereas proteins or protein regions having enrichment of a specific type of residues belong to the category of non-standard/biased composition proteins. These compositional biases indicate that the underlying substitution frequencies of amino acids are also distinct. Hence, proteins or protein segments with standard and non-standard compositions should be treated differently during sequence analysis.

    The ability of tools to identify true homologs during database searches (pairwise alignment), and construct protein multiple sequence alignments (MSAs) largely depends on the quality of the substitution scoring matrices being used.7 Therefore, to achieve optimal alignment, composition-specific amino acid substitution scoring matrices should be used in combination with modified alignment algorithms. Amino acid substitution scoring matrices developed from proteins with standard composition, and which are widely used for the purpose of protein sequence alignment, are referred to as general purpose substitution matrices. However, scoring matrices developed from proteins or protein segments with biased/non-standard compositions, and which are used specifically for analysis of a particular class of proteins, are referred to as specialized substitution matrices. This review presents an overview of various substitution scoring matrices for proteins with a focus on different specialized scoring matrices (Table 1).

    TABLE 1. List of most commonly used general purpose and various specialized substitution scoring matrices
    Matrix Class Algorithm Description References
    General Purpose Matrices PAM Calculated from mutations observed across entire length of the closely related aligned proteins, which contain both highly mutable as well as highly conserved regions. PAM matrices are efficient in identifying close homologs. (Dayhoff, 1978)15
    BLOSUM Computed from amino acid substitutions observed in highly conserved ungapped alignment regions of evolutionary divergent proteins. These matrices can be used for sequence analysis over wide range of evolutionary scales. (Henikoff and Henikoff, 1992)16
    MD Modern PAM-based matrices derived from large and diverse set of protein sequences. Efficient in identifying close homologs. (Jones et al., 1992)17
    VTML Developed by iteratively evaluating the substitution rates and evolutionary distances from randomly selected large versatile pre-aligned pairwise sequence alignments dataset using a maximum likelihood estimator and Dayhoff’s model as initial rate matrix. (Muller and Vingron, 2000; Muller et al., 2002)18, 19
    Specialized Matrices PfSSM Matrix specific to AT enriched Plasmodium falciparum genome. As compared to the BLOSUM matrices, the PfSSM are more sensitive during homolog searches, and give rise to better quality alignments. (Paila et al., 2008)34
    CBM & CCF Matrix specific to AT enriched genome of Plasmodium falciparum and Plasmodium Yoelii. With respect to BLOSUM series, the CBM and CCF matrices shows an improved homolog search performance within Plasmodia species. (Brick and Pizzi, 2008)36
    JTT transmembrane A generalized scoring matrix specific to integral membrane proteins. Developed by considering the observed mutations in the transmembrane regions. In contrast to other general purpose matrix, JTT transmembrane matrix aligns integral membrane proteins with higher accuracies . (Jones et al., 1994)49
    PHAT Matrix specific to α-helical membrane proteins. Predicted transmembrane and hydrophobic segments was used for computing target and background frequencies of residues respectively. In homolog searches, PHAT performs significantly better than the JTT transmembrane and other generalized matrices for query sequences having transmembrane segments. (Ng et al., 2000)60
    SLIM Matrix specific to α-helical integral membrane proteins. Computation of SLIM matrix is similar to PHAT except usage of background frequencies of VTML matrices as database sequences background frequencies. SLIM matrix outperforms PHAT and BLOSUM matrices both on simulated and real datasets of G protein-coupled receptors. (Muller et al., 2001)51
    bbTM Matrix specific to β-barrel transmembrane proteins. bbTM matrix is the average of similarity scoring matrices of 7 non-homologous β-barrel membrane proteins and their homologs. As compared to PHAT and SLIM matrices, bbTM matrices are more sensitive and specific towards identification of remote homologs of β-barrel transmembrane proteins. (Morales et al., 2008)52
    GPCRtm Matrix specific to rhodopsin family of G protein-coupled receptors. Curated alignments of transmembrane regions of rhodopsin family GPCR sequences was used to compute GPCRtm. GPCRtm matrix is more sensitive than the JTT transmembrane, PHAT and BLOSUM matrices during identification of close homologs, and generating accurate alignments in the transmembrane segments of GPCRs. (Rios et al., 2015)57
    Hubsm Matrix specific to hub proteins of protein-protein interaction networks. In Hub proteins specific database searches and multiple sequence alignment constructions, Hubsm matrices showed a high degree of sensitivity, specificity and accuracy as compared to BLOSUM and PAM series matrices. (Renganayaki et al., 2017)66
    DUNMat Matrix specific to intrinsically disordered proteins. Alignment of structurally characterized disordered regions of query proteins and their homologs was used for compilation of scoring matrices. In comparison to BLOSUM and PAM matrices, improvisation in the specificity and sensitivity of detection of homologs with less than 50% sequence identity was observed with DUNMat matrix. (Radivojac et al., 2002)73
    MidicMat Matrix specific to intrinsically disordered proteins. MidicMat matrix was computed by considering substitutions in aligned disorder to disordered regions of 1000 UNIREF protein sequences and their homologs. MidicMat matrix assigns higher scores or smaller penalties for the non-identical residue substitutions in disordered regions, where such spontaneous changes are more likely to happen due to its higher evolutionary rate. (Midic et al., 2009)74
    Disorder It is disorder proteins-specific evolutionary model. This matrix utility has not been tested for sequence analysis. Comparative analysis of frequencies of the amino acids suggest that the evolution of disordered proteins is most similar to the coils and turns of ordered proteins. (Brown et al., 2010)79
    EDSSMat Matrix specific to intrinsically disordered proteins/regions of eukaryotes. EDSSMat matrices were computed from disordered alignment blocks extracted from alignments of protein families by considering disorder and secondary structure prediction results. In contrast to routinely used standard search matrices and previously developed disordered proteins specific matrices, these EDSSMat matrices are able to identify both close and distant homologs of highly disordered proteins. (Trivedi and Nagarajaram, 2019)80
    • Note: The general purpose matrices are specific to a large class of ordered/globular proteins or protein segments. Whereas specialized substitution scoring matrices represent substitution pattern of residue pairs for distinct class of proteins with biased compositions.

    2 GENERAL PURPOSE SUBSTITUTION MATRICES

    Scoring matrices that are used as default parameters in popular alignment tools (such as EMBOSS Needle, Clustal Omega, Muscle, T-Coffee, MAFFT),8-12 and homology search tools (like SSEARCH/FASTA, BLAST)13, 14 are collectively referred as general purpose amino acid substitution scoring matrices. PAM, BLOSUM, MD and VTML series of matrices are some of the most commonly used general purpose matrices.15-19 These matrices have been derived from pair-wise substitution frequencies of amino acid residues as discerned from the alignments of proteins related at a given hierarchy level (class, family or superfamily), physico-chemical properties of residues, structure-based sequence alignments, and different structural environments of residues.

    2.1 Point accepted mutation matrix

    Margaret Dayhoff and colleagues, in the 1970s, were the first to develop amino acid substitution matrices from a set of closely related proteins (minimum 85% sequence identity). These matrices have since been referred to as point/percentage accepted mutation (PAM) matrices.15 By considering 1,572 mutations in phylogenetic trees of 71 protein families, Dayhoff computed the probability of a given amino acid being replaced by any other at a given evolutionary distance (in this case 1 accepted point mutation per 100 residues = PAM1). For protein comparison, these mutational probabilities between amino acids were normalized and represented on a logarithmic scale as log odd (LOD) scores. For a given amino acid pair, LOD score in half-bit units is defined as the two times log2 of the ratio of their observed and expected frequencies of occurrence. The general formula for computation of lod score (Sij) for any pair of amino acids is as follows:
    S ij = 2 lo g 2 q ij / p i p j

    Here qij represents the observed frequency of occurrence of amino acids pair i and j, and pipj is the expected frequency of occurrence of residues pair i and j.

    In compiling the PAM matrices, every mutational event at a given site was considered to be independent of any previous changes at that site.15 According to this Markov model of protein evolution, repeated mutations over a longer period of evolutionary time would follow the same substitution pattern as those found in relatively short evolutionary distances. Hence, PAM matrices for more distantly related sequences were computed by extrapolating a matrix of closely related sequences. Since the direction of mutational events cannot be determined at a given site in a protein, the PAM matrices are symmetrical in nature.

    2.2 PAM-based matrices

    Using large and diverse sets of protein sequences, several other matrices were developed using the PAM formalism.17, 20 The MD matrix developed by Thornton and colleagues is considered as a modern version of PAM matrices.17 The most recent siblings of PAM matrices include the VTML matrices.18, 19 VTML matrices are developed by iteratively evaluating the substitution rates and evolutionary distances from a given set of pairwise sequence alignments using a maximum likelihood estimator and Dayhoff's model as the initial rate matrix. Pairwise alignments considered for matrix computation are obtained through the random selection of a pair of pre-aligned sequences from each protein family of the SYSTERS database.21 Usage of a large versatile dataset during VTML matrix series computation enabled these matrices to detect remote homologs with a higher degree of confidence, and also to construct high quality MSAs. The quality of an MSA is determined from the alignment accuracies achieved generated by the substitution matrices on different manually refined state-of-the-art protein alignment benchmark datasets (BALIBASE, PREFAB, SABmark, and OXBench).22-25

    2.3 Blocks substitution matrix

    BLOSUM, developed by Henikoff, was the next major development among the matrices.16 These matrices were developed from un-gapped alignments (blocks) derived from evolutionary divergent sequences of protein families. Alignment blocks are actually the conserved regions within related proteins assumed to be of higher functional relevance. Unlike PAM, where all the amino acid positions of proteins are scored, in BLOSUM only protein sites that are part of blocks are used to compute LOD scores (ratio of the observed and expected frequencies of occurrence for a residue pair on a logarithmic scale) irrespective of the overall degree of similarity between the protein sequences. In the alignments of the evolutionary divergent sequences, the BLOSUM matrices introduce less number of misaligned residues than the extrapolated PAM matrices, which is based on the assumption that the evolution of sequences over long evolutionary time scales can be well approximated by joining small changes that occur over a small evolutionary distance.16 Usage of large evolutionary diverse protein dataset to derive BLOSUM series scoring matrices has actually equipped these matrices with an ability to carry out sequence analysis over a wide range of evolutionary time scales.

    Similar to the PAM matrix series, BLOSUM series scoring matrices have been also used as the basis of development of other matrices such as PBLOSUM, OPTIMA and so forth.26, 27 These newly developed substitution matrices have improved homolog search and MSAs construction abilities than BLOSUM. However their compilation was based on the same small outdated protein dataset that has been used for BLOSUM series.

    2.4 Other general purpose substitution matrices

    In addition to the most commonly used PAM and BLOSUM matrices, several other general purpose amino acid substitution matrices were proposed over the years. Considering the sequence alignments of proteins related at a given level (class, family or superfamily) as the basis of matrix score computation, scoring matrices were proposed by Benner and co-workers,28 Fan,29 and Steven and coworkers.30 Substitution matrices were also built based on physico-chemical properties of amino acids like those developed by Grantham,31 Miyata et al.,32 and MohanaRao.33 Apart from the above mentioned sequence alignments and residue properties based scoring matrices, number of substitution matrices were also developed from the structure-based sequence alignments.34-39 Considering the fact that the different local structural environments (“side-chain accessibility,” “amino acid type,” “hydrogen bond formation,” and “secondary structure”) influence the residues substitution patterns, several scoring matrices were also constructed to improve alignment quality.4, 5, 40-42 Despite the development of different substitution matrices, the PAM and BLOSUM series matrices remain the most commonly used general purpose matrices. But these general purpose scoring matrices fail to perform optimally for proteins with biased/non-standard amino acid compositions, therefore, different specialized scoring matrices have been developed to study specific class of proteins.

    3 SPECIALIZED SUBSTITUTION MATRICES

    The general scoring matrices sub-optimal performances for proteins with non-standard/biased amino acid compositions led researchers to compute specialized sets of amino acid substitution matrices. Biased compositions in proteins may result either from the nucleotide compositional constraints such as AT/GC rich genomes, codon usage bias and so forth,43-46 or from the functional constraints like hydrophobic or cysteine-rich proteins, transmembrane proteins, intrinsically disordered proteins and so forth. Performance-wise these specialized scoring matrices have been reported to be more sensitive than standard matrices at identifying biological associations among these proteins. Besides developing specific scoring matrices for proteins with non-standard/biased amino acid compositions, there have also been few attempts to transform general purpose matrices and use them under biased compositional conditions.47, 48 Various specialized scoring matrices for distinct classes of proteins with non-standard/biased amino acid compositions have been summarized in Table 1.

    3.1 AT/GC enriched genomes specific substitution matrices

    The relative abundance of different nucleotides across genomes can vary substantially.49, 50 While some genomes are enriched with adenine and thymine (e.g., Plasmodium falciparum), others have an excessive amounts of the guanine and cytosine nucleotides (e.g., Mycobacterium tuberculosis). Each of these nucleotide composition enrichments has been observed to be associated with specific adaptive advantages.51-53 If these nucleotide compositional constraints are manifested at non-synonymous sites of protein-coding genes, the amino acid compositions of such proteins are expected to change in a direction anticipated by underlying nucleotide bias over evolutionary time.54 Furthermore, in response to this genome-level nucleotide compositional bias, many organisms manifest preference for one codon over another for a given residue (codon usage bias). In general, individual codons with higher GC content are more frequent in GC-rich genomes, whereas, AU-rich codons are highly common in AT enriched genomes. Example of this class of specialized amino acid substitution matrices includes the matrices that were developed to analyze specific compositionally biased genomes such as Plasmodium (AT nucleotides rich).46, 48

    Paila et al. computed PfSSM (P. falciparum Specific Substitution Matrix) series of symmetric and asymmetric matrices from the conserved alignment blocks derived from a unique dataset of 4,696 proteins which contains P. falciparum sequences and its distant orthologs (BLAST hits with similar annotations). PfSSM series of matrix was developed using Henikoff's formalism.16 As compared to the BLOSUM series of matrices, the PfSSM series were observed to be more sensitive during homolog searches, and gave rise to better quality of alignments.

    Brick and Pizzi developed CBM and CCF series of matrices by categorically selecting alignment blocks from the BLOCKS database with similar compositional bias as that of P. falciparum and P. yoelii proteins.55, 56 For both the Plasmodium species, the sequence data were retrieved from the PlasmoDB database v5.0.57, 58 Only proteins having either annotated non-Plasmodia orthologs or have been classified as hypothetical with no known orthologs in non-Plasmodia species were considered to study residue compositional bias in Plasmodium species. With respect to BLOSUM matrix series, the CBM and CCF matrices showed an improved homolog search specificity and filtering out of false positives within Plasmodia species. These matrices even provide a better distinction for members of P. falciparum multi gene families between sub-families.

    3.2 Trans-membrane/cysteine-rich proteins specific substitution matrices

    Different functional constraints have led the enrichment of hydrophobic amino acids within the trans-membrane and cysteine-rich class of proteins. The distinct composition of these proteins as compared to that of proteins with standard compositions indicate that substitution frequencies of the residues in the trans-membrane and cysteine-rich proteins are also distinct. Therefore, in order to have a better understanding of the sequence, structure, function and evolution of trans-membrane and cysteine-rich proteins, substitution matrices specific to these classes of proteins have been developed. Examples of such substitution matrices include those developed for trans-membrane proteins, that is, JTT transmembrane (JTTtm) matrix,59 predicted hydrophobic and transmembrane (PHAT) matrix,60 scorematrix leading to intra-membrane domains (SLIM) matrix,61 beta-barrel transmembrane matrices (bbTM),62 G protein-coupled receptor specific GPCRtm matrix and so forth.63

    3.2.1 JTT matrix

    Jones–Taylor–Thornton (JTT) developed a scoring matrix specific to integral membrane proteins by considering the observed mutations in the transmembrane segments.59 The dataset used to compile the JTT transmembrane matrix was comprised of 5,662 documented transmembrane segments from 1765 Swissprot protein sequences and their close homologs (at least 85% sequence identity). The JTT transmembrane matrix was developed using a method very similar to that described by Dayhoff for PAM matrix compilation.15 Initially, single-spanning and multiple-spanning transmembrane segments were analyzed separately, and later combined to give the JTT transmembrane matrix. Alignment programs using two-matrix scheme, that is, JTT matrix for regions spanning membrane and other general matrix for polar flanking regions showed improvement in the accuracies of integral membrane proteins sequence alignments. As the development of JTT transmembrane matrices was intended for producing good alignments of related transmembrane proteins, its utility for tasks like database searches was not tested.

    3.2.2 PHAT matrix

    In contrast to the generalized transmembrane matrix developed by Thornton and colleagues,59 the PHAT matrix is specific to α-helical membrane proteins.60 Two derived datasets of alignment blocks from the BLOCKS database was prepared for PHAT matrix compilation. The first dataset of 844 blocks with predicted transmembrane segments was used to compute target frequencies, and the second dataset of 514 blocks with predicted hydrophobic regions was used for calculating background frequencies of residues. Matrix series developed using these target and background frequencies was termed PHAT as it was constructed from predicted hydrophobic and transmembrane regions. These PHAT series matrices were observed to perform significantly better in homolog searches than the generalized matrices and the JTT transmembrane matrix for query sequences having transmembrane segments.

    3.2.3 SLIM matrix

    Similar to the PHAT matrix, the SLIM matrix is also specific to α-helical integral membrane proteins. However, the SLIM matrix is non-symmetrical in nature. This asymmetry is an outcome of the different background compositions of the transmembrane query proteins and unknown search space sequences.61 Background frequencies of PHDhtm (the same that were used for PHAT matrix creation) and VTML matrices were used as transmembrane query and database sequences background frequencies respectively. Target frequencies at a given evolutionary distance were derived from BLOSUM75/PHDhtm target frequencies and its associated stationary distributions from blocks alignments. Using these background and target frequencies the SLIM matrix was constructed, and its database search performance was assessed. SLIM matrix outperforms PHAT and BLOSUM series matrices both on simulated data and a real dataset of GPCRs from GPCRDB.64

    3.2.4 bbTM matrix

    Liang and colleagues developed scoring matrices specific to β-barrel membrane proteins. These include a large proportion of proteins in a typical genome.62, 65 A dataset of seven non-homologous β-barrel membrane proteins with known crystal structures and their homologs was used to compute the bbTM matrix. First, amino acid substitution rates were determined by implementing a Bayesian Markov Monte Carlo simulation approach.66 Then, similarity scoring matrices for each of these seven dataset proteins were calculated from substitution rates at different evolutionary distances and finally averaged in to a bbTM. For a distant homolog search for transmembrane β-barrel proteins, bbTM matrices emerge as superior with better ability to detect true homologs (sensitivity) and not to identify non-homologs (specificity) as compared to that of previously developed α-helical membrane proteins specific PHAT and SLIM matrices.

    3.2.5 GPCRtm matrix

    GPCRs are a major class of integral membrane proteins characterized by the presence of highly conserved seven α-helical transmembrane (TM) segments. Based on sequence similarities, the GPCRs have been further classified in to six classes, with the rhodopsin family being the largest. GPCRtm is a GPCR rhodopsin family-specific matrix,63 and has been obtained from the curated alignments of transmembrane regions of 1019 rhodopsin family GPCR sequences using Henikoff's method.16 TM segments were identified based on the annotations of UniProt and GPCRDB databases.67, 68 The comparison of matrix scores of GPCR rhodopsin family-specific GPCRtm matrix and ordered/globular proteins specific BLOSUM62 is shown in Figure 1a. By looking at the diagonal matrix scores, each amino acid likelihood to mutate (relative mutability) can be estimated. The low matrix scores (≤ 2) of hydrophobic residues isoleucine (I), alanine (A), phenylalanine (F), leucine (L), and valine (V) exhibit a higher rate of relative mutability of these residues in transmembrane regions of GPCRs. Whereas, the lowest level of relative mutability of charged and polar residues is displayed by high matrix scores (≥7) of asparagine (N), aspartic acid (D), arginine (R), tryptophan (W), and proline (P). All these residues (N, D, R, W, and P) show a high degree of conservation pattern in transmembrane helices of GPCRs.69, 70 Also, the properties displayed by the GPCRtm matrix are in-between that of generalized transmembrane JTTtm and water-soluble globular proteins specific BLOSUM62 matrices as shown in Figure 1b. Polar and charged residues (lysine [K], arginine [R], histidine [H], aspartic acid [D], glutamic acid [E], asparagine [N] and glutamine [Q]) substitution frequencies in GPCRtm is higher (more number of positive difference matrix scores) than BLOSUM62, but lower (more number of negative difference matrix scores) than in the JTTtm matrix. Altogether, GPCRtm matrix scores analysis revealed that the GPCRs possess distinctive frequencies of polar and charged residues, very different from other membrane proteins and proteins in general.

    Details are in the caption following the image
    Comparison of matrix scores of general purpose and specialized amino acid substitution scoring matrices. (a) The upper and lower half diagonal represents ordered/globular proteins specific general purpose BLOSUM62 and G protein-coupled receptors (a major class of integral membrane proteins) rhodopsin family-specific GPCRtm matrix values respectively. While lower diagonal elements score represents higher relative mutability of hydrophobic amino acids (A, F, L, I, V, etc.), the higher values indicate lower mutability of polar/charged residues (N, D, R, W, P, etc.). (b) Difference matrix derived by subtracting matrix scores of generalized transmembrane JTTtm matrix (lower) and BLOSUM62 (upper) from GPCRtm values. Polar and charged residues (K, R, H, D, E, N, Q) substitution frequencies in GPCRtm is higher than BLOSUM62 but lower than in the JTTtm matrix

    During the identification of distant homologs, the GPCRtm matrix was observed to be more sensitive than the JTTtm, PHAT, and BLOSUM series of matrices. Also, the GPCRtm matrix performs better than other general purpose matrices in resolving hits at the subfamily level. Pairwise alignments of the rhodopsin family proteins adenosine A2A receptor (AA2AR) and sphingosine-1-phosphate receptor 1 (S1PR1) using the GPCRtm and other scoring matrices was performed to test the alignment accuracies.63 A comparison of alignments with available three-dimensional structural data revealed that the GPCRtm matrix generates more accurate alignments in the transmembrane segments of GPCRs than the BLOSUM, PHAT, and JTTtm matrices.63

    3.3 Matrices specific to hubs in protein–protein interaction networks

    The central proteins that can interact with a large number of other proteins in protein–protein interaction networks (PPINs) are known as hubs. Mostly, the hub proteins of PPINs are composed of low complexity disordered regions.71 Therefore, frequencies of substitution of residues in these proteins are expected to be different. Recently, a PPIN hub proteins specific substitution matrix, that is, Hubsm was constructed using protein–protein interaction data from the InterPro database for the six model organisms Escherichia coli, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae, Arabidopsis thaliana and Homo sapiens.72 The connectivity fold change, defined as the ratio of the species-specific connectivity threshold value and average connectivity (total number of interactions/total number of proteins) was used in a given model organism to classify a protein as a hub protein.73, 74 The alignments of disordered and low complexity region enriched domains of hub proteins were then used to compute the scoring matrix using an implementation of Dayhoff's method.15 In Hub proteins specific database searches, the performance of scoring matrices were evaluated using coverage measure (sensitivity) at a given error per query (specificity) threshold. While coverage values defines the total number of true homologs identified in the database, the errors per query limits the number of false positives. At 0.001 error per query value, Hubsm matrices achieved a coverage value of ~0.14 whereas BLOSUM and PAM matrices achieved significantly lower coverage of ~0.8 and ~0.3 respectively. In contrast to the BLOSUM and PAM matrices, Hubsm matrix obtained in total 1,536 unique homologs during homolog searches of 400 hub proteins against ASTRAL databases.75, 76

    3.4 Intrinsically disordered proteins specific substitution matrices

    Intrinsically disordered proteins or protein segments are enriched with a high proportion of polar and charged amino acids, and have a distinct evolutionary rate. Therefore, the substitution frequencies of residues in these disordered proteins/regions are expected to be distinct as compared to those in ordered regions. Over the years, several studies have been performed to understand the sequence complexity, amino acid compositions, and average evolutionary rates within intrinsically disordered proteins/regions.77-80 Also, there have been several attempts to develop intrinsically disordered proteins or regions-specific amino acid substitution matrices. Henceforth, the disordered proteins/regions-specific scoring matrices developed by Dunker and coworkers, Obradovic and colleagues, and Brown and coworkers will be referred to as DUNMat, MidicMat and Disorder series matrices respectively.77-79 More recently, eukaryotic disordered proteins/regions specific eukaryotic disorder substitution scoring matrix (EDSSMat) series matrices has also been developed.81

    3.4.1 DUNMat matrix

    The first attempt to analyze sequence alignments of intrinsically disordered proteins was made by Dunker and coworkers.77 In this study, a disordered region-specific matrix (referred here as DUNMat) was developed using an iterative algorithm for realigning and recalculating matrix values from a dataset of 55 protein families with structurally characterized disordered regions of length ≥40 consecutive residues. Homologs of the given protein families were identified using the BLAST algorithm. Initial alignment of dataset proteins and their homologs was carried out using the BLOSUM matrix. In cases where the dataset protein possessed both ordered and disordered regions, the regions aligning to the structurally characterized disordered regions were also assumed to be disordered. The performance of the DUNMat matrix showed an improvement in the specificity and sensitivity of detection of homologs with less than 50% sequence identity. However, the preferential implementation of the DUNMat matrix in contrast to general purpose BLOSUM and PAM matrices has not been discussed.

    3.4.2 MidicMat matrix

    Obradovic and coworkers developed a 40 × 40 matrix having four 20 × 20 sub-matrices depicting substitutions in aligned: (a) order to ordered regions, (b) order to disordered regions, (c) disorder to ordered regions, and (d) disorder to disordered regions (referred here as MidicMat).78 In this study, 1,000 protein sequences from different families were randomly selected from the UNIREF database (referred to as “anchors”), and their homologs were identified using BLAST. For each of these anchors and their homologs the disorder was predicted using VSLB2 predictor,82 and initial alignments were done using the BLOSUM62 matrix and ClustalW alignment tool.83 Similar to DUNMat matrix, development of MidicMat also involved an iterative procedure of realignment and recalculation of matrix values until the changes between iterations became negligible. In contrast to the BLOSUM62 scoring matrix, which tends to penalize substitutions involving non-identical residues, the MidicMat matrix assigns higher scores or smaller penalties for the non-identical residue substitutions in disordered regions, where such spontaneous changes are more likely to happen due to the higher evolutionary rate.

    3.4.3 Disorder series matrix

    Brown and coworkers developed a disorder protein-specific evolutionary model using putative homologs of 287 experimentally characterized disordered proteins from the DisProt database.79, 84 The substitution matrices (referred here as Disorder series matrix) were developed at three different percentage sequence identity thresholds, that is, >85% (Disorder85), 85–60% (Disorder60), and 60–40% (Disorder40) using an iterative protocol of realignment and recalculation of matrix values. Initial pairwise alignments were generated using the European Molecular Biology Open Software Suite (EMBOSS) needle alignment program [Matrix = BLOSUM62, gap opening penalty = 10, and gap extension penalty = 0.5].8, 85 In these Disorder series matrices, a much higher likelihood of non-conservative substitutions was observed as compared to that of simultaneously developed ordered proteins-specific matrices. Also, a comparative analysis of frequencies of the amino acids suggest that evolution of disordered proteins is most similar to the coils and turns of ordered proteins. However, the authors of this article clearly mention that the utility of these matrices is only for the comparative studies of disordered and ordered protein evolution rather than improving sequence alignments.79

    3.4.4 EDSSMat series matrix

    EDSSMat matrices were developed by exclusively considering amino acids substitution frequencies in disordered regions of a curated dataset of about 4,000 eukaryotic protein families using Henikoff's formalism of matrix development.16, 81 For matrix compilation purpose, the residues that are predicted as disordered by IUPred and as a part of coil region by SSPro has been only considered from the alignment of protein families.86, 87 Also, PRANK+F tool (gap opening rate = 0.005, gap extension probability = 0.5, number of iterations = 5) has been used for alignment purpose which is known to impose insertions in accordance with phylogeny, and prevents overestimation of deletion events.88 As compared to routinely used general purpose matrices (like BLOSUM, PAM, MD, VTML etc.) and other previously developed disordered region-specific matrices (such as DUNMat, MidicMat and Disorder series), the EDSSMat matrices have been shown to identify both close and distant homologs with better sensitivities for proteins enriched with disordered regions.

    4 CONCLUSION

    Developing technologies have scaled up proteome-wide research, and regularly an increasing number of gene sequences are being determined. Such advances have also contributed to the growth of bioinformatics, and a broad range of tools relevant to protein sequence analysis has been created. There have been continuous efforts to improve the performance of these tools by developing novel substitution scoring matrices and alignment algorithms. While the modified versions of dynamic programming alignment algorithm has enabled us to penalize and score distinct segments of a sequence differently using different gap parameters and scoring matrices, the development of optimized substitution scoring matrices is still on.

    As discussed in the review, initial developments of substitution scoring matrices did not consider any compositional differences among proteins. They were mostly based on sequence or structure-based sequence alignments of related proteins, physico-chemical properties, and different structural environments of residues. These early scoring matrices achieved different degrees of success in homologs detection (pairwise alignments) and construction of MSAs. After extensive usage, the PAM/PAM-based and BLOSUM series matrices emerged as the most commonly used general purpose matrices and are specific to a large class of ordered/globular proteins or protein segments. However, these general purpose scoring matrices fail to perform optimally in various tasks involving proteins with biased compositions. In order to overcome these limitations, there have been incessant efforts to compute matrices specific to proteins/segments enriched with a particular type of residues either because of nucleotide compositional or functional constraints. Development of specialized scoring matrices specific to AT/GC rich genomes, α-helical and β-barrel integral membrane proteins, hub proteins of PPINs, GPCRs rhodopsin family, and intrinsically disordered proteins are the outcomes of such endeavors. Interestingly, these specialized scoring matrices have been validated for various practical applications, but rarely they become part of database search and alignment tools. In the future, we can expect sequence analysis tools that can classify proteins or protein segments in specific categories based on their residue compositions, and implement appropriate substitution scoring matrix, gap parameters, and alignment algorithms to generate optimal alignments. This approach of protein sequence analysis using refined alignments may help in more conclusive studies concerning protein sequence, structure, function, and evolution.

    ACKNOWLEDGMENTS

    The authors are thankful to Centre for DNA Fingerprinting and Diagnostics (CDFD) and Department of Biotechnology, Government of India, sponsored Bioinformatics Infrastructure Facility (BIF) at School of Life Sciences, University of Hyderabad for providing computational facilities and literature which helped in completing this review.

      AUTHOR CONTRIBUTIONS

      Rakesh Trivedi: Conceptualization; writing-original draft; writing-review and editing. Hampapathalu Adimurthy Nagarajaram: Conceptualization; writing-original draft; writing-review and editing.

      CONFLICT OF INTEREST

      The authors declare no conflicts of interest with the contents of this article.

        The full text of this article hosted at iucr.org is unavailable due to technical difficulties.