Alignment accuracy
The standard method for measuring the accuracy of multiple alignment algorithms is to use benchmark test sets of reference alignments, generated with reference to three‐dimensional structures. Here, we present results from a range of packages tested on three benchmarks: BAliBASE (
Thompson et al, 2005), Prefab (
Edgar, 2004) and an extended version of HomFam (
Blackshields et al, 2010). For these tests, we just report results using the default settings for all programs but with two exceptions, which were needed to allow MUSCLE (
Edgar, 2004) and MAFFT to align the biggest test cases in HomFam. For test cases with >3000 sequences, we run MUSCLE with the –maxiter parameter set to 2, in order to finish the alignments in reasonable times. Second, we have run several different programs from the MAFFT package. MAFFT (
Katoh et al, 2002) consists of a series of programs that can be run separately or called automatically from a script with the ‐‐
auto flag set. This flag chooses to run a slow, consistency‐based program (L‐INS‐i) when the number and lengths of sequences is small. When the numbers exceed inbuilt thresholds, a conventional progressive aligner is used (FFT‐NS‐2). The latter is also the program that is run by default if MAFFT is called with no flags set. For very large data sets, the
‐‐parttree flag must be set on the command line and a very fast guide tree calculation is then used.
The results for the BAliBASE benchmark tests are shown in
Table I. BAliBASE is divided into six ‘references.’ Average scores are given for each reference, along with total run times and average total column (TC) scores, which give the proportion of the total alignment columns that is recovered. A score of 1.0 indicates perfect agreement with the benchmark. There are two rows for the MAFFT package: MAFFT (auto) and MAFFT default. In most (203 out of 218) BAliBASE test cases, the number of sequences is small and the script runs L‐INS‐i, which is the slow accurate program that uses the consistency heuristic (
Notredame et al, 2000) that is also used by MSAprobs (
Liu et al, 2010), Probalign, Probcons (
Do et al, 2005) and T‐Coffee. These programs are all restricted to small numbers of sequences but tend to give accurate alignments. This is clearly reflected in the times and average scores in
Table I. The times range from 25 min up to 22 h for these packages and the accuracies range from 55 to 61% of columns correct. Clustal Omega only takes 9 min for the same runs but has an accuracy level that is similar to that of Probcons and T‐Coffee.
The rest of the table is mainly taken by the programs that use progressive alignment. Some of these are very fast but this speed is matched by a considerable drop in accuracy compared with the consistency‐based programs and Clustal Omega. The weakest program here, is Clustal W (
Larkin et al, 2007) followed by PRANK (
Löytynoja and Goldman, 2008). PRANK is not designed for aligning distantly related sequences but at giving good alignments for phylogenetic work with special attention to gaps. These gap positions are not included in these tests as they tend not to be structurally conserved. Dialign (
Morgenstern et al, 1998) does not use consistency or progressive alignment but is based on finding best local multiple alignments. FSA (
Bradley et al, 2009) uses sampling of pairwise alignments and ‘sequence annealing’ and has been shown to deliver good nucleotide sequence alignments in the past.
The Prefab benchmark test results are shown in
Table II. Here, the results are divided into five groups according to the percent identity of the sequences. The overall scores range from 53 to 73% of columns correct. The consistency‐based programs MSAprobs, MAFFT L‐INS‐i, Probalign, Probcons and T‐Coffee, are again the most accurate but with long run times. Clustal Omega is close to the consistency programs in accuracy but is much faster. There is then a gap to the faster progressive based programs of MUSCLE, MAFFT, Kalign (
Lassmann and Sonnhammer, 2005) and Clustal W.
Results from testing large alignments with up to 50 000 sequences are given in
Table III using HomFam. Here, each alignment is made up of a core of a Homstrad (
Mizuguchi et al, 1998) structure‐based alignment of at least five sequences. These sequences are then inserted into a test set of sequences from the corresponding, homologous, Pfam domain. This gives very large sets of sequences to be aligned but the testing is only carried out on the sequences with known structures. Only some programs are able to deliver alignments at all, with data sets of this size. We restricted the comparisons to Clustal Omega, MAFFT, MUSCLE and Kalign. MAFFT with default settings, has a limit of 20 000 sequences and we only use MAFFT with ‐‐parttree for the last section of
Table III. MUSCLE becomes increasingly slow when you get over 3000 sequences. Therefore, for >3000 sequences we used MUSCLE with the faster but less accurate setting of –maxiters 2, which restricts the number of iterations to two.
Overall, Clustal Omega is easily the most accurate program in
Table III. The run times show MAFFT default and Kalign to be exceptionally fast on the smaller test cases and MAFFT ‐‐parttree to be very fast on the biggest families. Clustal Omega does scale well, however, with increasing numbers of sequences. This scaling is described in more detail in the
Supplementary Information. We do have two further test cases with >50 000 sequences, but it was not possible to get results for these from MUSCLE or Kalign. These are described in the
Supplementary Information as well.
Table III gives overall run times for the four programs evaluated with HomFam.
Figure 1 resolves these run times case by case. Kalign is very fast for small families but does not scale as well. Overall, MAFFT is faster than the other programs over all test case sizes but Clustal Omega scales similarly. Points in
Figure 1 represent different families with different average sequence lengths and pairwise identities. Therefore, the scalability trend is fuzzy, with larger dots occurring generally above smaller dots.
Supplementary Figure S3 shows scalability data, where subsets of increasing size are sampled from one large family only. This reduces variability in pairwise identity and sequence length.
External profile alignment
Clustal Omega can read extra information from a profile HMM derived from preexisting alignments. For example, if a user wishes to align a set of globin sequences and has an existing globin alignment, this alignment can be converted to a profile HMM and used as well as the sequence input file. This HMM is here referred to as an ‘external profile’ and its use in this way as ‘external profile alignment’ (EPA). During EPA, each sequence in the input set is aligned to the external profile. Pseudocount information from the external profile is then transferred, position by position, to the input sequence. Ideally, this would be used with large curated alignments of particular proteins or domains of interest such as are used in metagenomics projects. Rather than taking the input sequences and aligning them from scratch, every time new sequences are found, the alignment should be carefully maintained and used as an external profile for EPA. Clustal Omega also can align sequences to existing alignments using conventional alignment methods. Users can add sequences to an alignment, one by one or align a set of aligned sequences to the alignment.
In this paper, we demonstrate the EPA approach with two examples. First, we take the 94 HomFam test cases from the previous section and use the corresponding Pfam HMM for EPA. Before EPA, the average accuracy for the test cases was 0.627 of correctly aligned Homstrad positions but after EPA it rises to 0.653. This is plotted, test case for test case in
Figure 2A. Each dot is one test case with the TC score for Clustal Omega plotted against the score using EPA. The second example is illustrated in
Figure 2B. Here, we take all the BAliBASE reference sets and align them as normal using Clustal Omega and obtain the benchmark result of 0.554 of columns correctly aligned, as already reported in
Table I. For EPA, we use the benchmark reference alignments themselves as external profiles. The results now jump to 0.857 of columns correct. This is a jump of over 30% and while it is not a valid measure of Clustal Omega accuracy for comparison with other programs, it does illustrate the potential power of EPA to use information in external alignments.