Using OrthoMCL to Assign Proteins to OrthoMCL-DB Groups or to Cluster Proteomes Into New Ortholog Groups
Abstract
OrthoMCL is an algorithm for grouping proteins into ortholog groups based on their sequence similarity. OrthoMCL-DB is a public database that allows users to browse and view ortholog groups that were pre-computed using the OrthoMCL algorithm. Version 4 of this database contained 116,536 ortholog groups clustered from 1,270,853 proteins obtained from 88 eukaryotic genomes, 16 archaean genomes, and 34 bacterial genomes. Future versions of OrthoMCL-DB will include more proteomes as more genomes are sequenced. Here, we describe how you can group your proteins of interest into ortholog clusters using two different means provided by the OrthoMCL system. The OrthoMCL-DB Web site has a tool for uploading and grouping a set of protein sequences, typically representing a proteome. This method maps the uploaded proteins to existing groups in OrthoMCL-DB. Alternatively, if you have proteins from a set of genomes that need to be grouped, you can download, install, and run the stand-alone OrthoMCL software. Curr. Protoc. Bioinform. 35:6.12.1-6.12.19. © 2011 by John Wiley & Sons, Inc.
Introduction
An OrthoMCL group is a set of proteins across one or more species that represent putative orthologs and in-paralogs. Groups are formed by running the OrthoMCL (Li et al., 2003) software on the sequences from a set of proteomes (a proteome is the set of proteins that belong to a species; typically, its sequences are derived through the annotation of a complete genome). The input would be anywhere from a few to hundreds of proteomes.
OrthoMCL-DB (Chen et al., 2006) contains ortholog groups for most completely sequenced and annotated eukaryotes and for a number of completely sequenced and annotated prokaryotes. It provides a wealth of functionality, including domain architecture for each group, phyletic patterns for each group, and advanced querying, including phylogenetic pattern searches. Yet, it is in a sense limited by the genomes that are included.
Overview of algorithm
OrthoMCL groups proteins into “ortholog groups.” That name is a little misleading because the groups contain proteins related by:
orthology (recent descent)
in-paralogy (recent duplication)
co-orthology (recent descent and duplication).
The computation is done on protein sequence because it is more sensitive than genomic sequence, but by definition the evolutionary relationships the algorithm attempts to discover exist in genome space. A complication of using proteins is that proteomes may contain alternative proteins (from alternative transcription). As discussed in Phase 1 below, this can be partially addressed by filtering.
There are a number of approaches to identifying groups of proteins with a common evolutionary history (Chen et al., 2007). These include methods based on phylogeny, evolutionary distance metrics, and sequence similarity (BLAST; Altschul et al., 1990; see UNIT Unavailable). OrthoMCL uses the reciprocal best BLAST hit approach and makes an adjustment for species distance (normalization) to distinguish orthologs from in-paralogs. Groups are generated from the normalized BLAST scores between proteins using Markov clustering (Enright et al., 2002). Figure 1 provides an overview of the approach. Please refer to the OrthoMCL Algorithm Document (https://docs.google.com/View?id=dd996jxg_1gsqsp6) for details about how OrthoMCL-DB is created.
Overview of the protocols
The protocols described here will help you extend OrthoMCL-DB groups to include your own proteins (
Using
Choose
Strategic Planning
Alternative splice forms
Higher eukaryotes commonly have genes with alternative splice forms, leading to multiple proteins per gene. These “alternative proteins” can be either included or excluded in the input data (
There are theoretical and practical considerations when making the decision about including or excluding alternative proteins. OrthoMCL-DB Version 4 included only the longest splice form.
The theoretical reasons for excluding them are:
To avoid “pseudo-in-paralogs” which are alternative proteins within a group that therefore look like in-paralogs. If needed, these could be distinguished using the gene–>protein mapping files)
Group statistics (counts, coherence) are in terms of genes, and are not skewed by the presence of highly similar pseudo-in-paralogs. This helps avoids inaccuracy in terms of gene duplication events or paralogs.
The theoretical reasons against excluding them are:
Doing so may exclude important functional components of genes that are omitted in the longest variant.
Doing so may exclude distinct proteins that are omitted in the longest variant, for example, if an alternative splice form introduces a frame shift that leads to a very different protein.
The practical reasons for excluding them are:
Smaller input dataset, so requires less resources.
The practical reasons against excluding them are:
Requires a gene-->protein mapping file.
Many providers of protein sequence do not provide such a mapping. If you find one that does, it may not have the latest version of the protein sequences, so you may be forced into a tradeoff between getting the preferred protein sequences and getting a gene-protein mapping file.
Resources
Choosing proteomes
For
Basic Protocol
1: Assign a Proteome to OrthoMCL-DB Groups
Use this tool to assign the proteins in a genome to OrthoMCL-DB Groups and to find groups of in-paralogs. You will upload your genome's proteins as a FASTA file to a service on the OrthoMCL-DB Web site that maps them to OrthoMCL-DB groups. The result is a set of files, as described below (your proteins are not incorporated into the OrthoMCL-DB site). Because your proteins are not actually included in an OrthoMCL-DB build process, which is a multi-week effort, the results will be an approximation to the result of including them in the OrthoMCL-DB build, but not identical. While the standard use of this service is to upload the complete set of proteins of a single proteome, you can also upload a set of proteins that are from a partial proteome or many proteomes. In both cases, you will get valid assignments of your proteins to OrthoMCL-DB groups. However, in the former case, you will also get a valid set of in-paralog groupings, while in the latter you will not.
How the tool processes your proteins
This protocol maps your proteins to groups in OrthoMCL-DB. To understand how OrthoMCL-DB is created, please see the OrthoMCL-DB Methods page.
Phase 1
In this phase, the tool maps your proteins to groups in OrthoMCL-DB. It performs a BLASTP against all the proteins in OrthoMCL-DB, using a cutoff of 1e-5 and 50% match. Your protein is assigned to the group containing its best hit. If the best matching protein does not have a group, it is assigned to NO_GROUP.
Phase 2
In this phase, the tool processes all of your proteins that did not have an above-threshold match in phase 1. It uses the OrthoMCL-DB in-paralog algorithm described in the OrthoMCL Algorithm Document (https://docs.google.com/View?id=dd996jxg_1gsqsp6) to create pairs of potential in-paralogs. It then submits those pairs to the MCL program for clustering. The result is groups of your proteins that provisionally represent in-paralogs.
Necessary Resources
Hardware
A computer with an Internet connection
Software
An Internet browser, e.g., Internet Explorer (http://www.microsoft.com/ie), Firefox (http://www.mozilla.org/firefox), or Safari (http://www.apple.com/safari).
A program to unpack .zip files (these are standard on Macintoshes and Windows PCs)
Files
A FASTA file (see APPENDIX Unavailable) with a set of protein amino acid sequences. Typically the proteins will be from a single proteome, though this is not a requirement. The definition lines should have a unique identifier immediately following the > character, and followed by either a new-line or at least one space character.
1. From the OrthoMCL-DB home page (http://www.orthomcl.org; see Fig. 2) click on the Tools link at the top right, and choose “Assign your proteins to groups.”
2. Provide the name of your FASTA file, or Browse to it.
3. Provide your e-mail address so you can get your results.
4. Optionally provide a job name. This is for you to distinguish different jobs you do.
5. Submit the job.
6. You will receive an e-mail within 5 min informing you that the job is on the queue.
7. You will receive an e-mail when the job is done, possibly many hours later, but less than 24 hr. Please retrieve your results within 48 hr of getting this e-mail.
8. Download your results. They will be in packaged in a .zip file. Use a standard .zip unpacker.
Basic Protocol
2: Create Ortholog Groups from Your Proteomes Using the OrthoMCL Software
This protocol describes how to download, install, and run the OrthoMCL software. Use this software if you have at least two proteomes, and up to hundreds, and want to find ortholog groups. For details on the OrthoMCL algorithm, please read the OrthoMCL Algorithm Document (https://docs.google.com/View?id=dd996jxg_1gsqsp6).
The input to OrthoMCL is a set of proteomes. The output is a set of files:
pairs/
potentialOrthologs.txt
potentialCoorthologs.txt
potentialInparalogs.txt
groups.txt
The files in the pairs/ directory contain pairwise relationships between proteins and their scores. They are categorized into potential orthologs, co-orthologs, and in-paralogs as described in the OrthoMCL Algorithm Document (https://docs.google.com/View?id=dd996jxg_1gsqsp6). The groups.txt file contains the groups created by clustering the pairs with the MCL program.
How the software processes your proteins
In overview, there are four phases of processing by the software. The first phase is preparing your FASTA files (one per proteome). The second phase is running all-versus-all BLASTP (this document does not provide instructions for this; please see the NCBI BLAST documentation). The third phase is loading the BLASTP results into the relational database and running the OrthoMCL software to find significant pairs of proteins. The final phase is using the MCL software to cluster the pairs into groups.
The benchmark dataset
In this protocol we refer to a benchmark dataset. We have tested this set extensively. It had:
100 proteomes (across the tree of life, mostly eukaryotes)
1 million proteins
500 million significant similarities (BLAST hits).
The benchmark dataset took:
3 days to run all-versus-all BLAST on a 200 CPU compute cluster.
16 hr on a Linux server for the orthmclPairs processing to find pairs (using MySQL as the relational database)
2 hr on a Linux server for MCL to find the groups.
We base hardware requirements and time estimates on this benchmark dataset. The most significant predictor of resource/time requirements is the number of significant similarities. As this number changes, resource requirements will change nonlinearly.
Necessary Resources
Hardware
Computer resources to run an all-versus-all BLASTP on the proteins from your proteomes.
A file system with space sufficient to house the results of the all-versus-all-BLAST.
A database server to host an Oracle or MySQL database that will do significant relational database processing. The server's requirements vary dramatically with the size of your dataset. For the Benchmark Dataset, we recommend:
memory: at least 4G
disk: 100 GB free space. You can estimate your disk space needs more accurately when you have completed step 7 of this protocol. You will need at least 5 times the size of the BLAST results file produced in that step. 90% of the disk space required is to load that file into the database and index it in the database.
A Linux server to run the OrthoMCL software
Software
NCBI BLASTP (see UNIT Unavailable). Note that The OrthoMCL software requires NCBI BLASTP output. Other BLASTP software versions do not have compatible output. The NCBI BLAST software is available at ftp://ftp.ncbi.nih.gov/blast/.
The Linux operating system. The orthomclPairs program has only been tested on Linux. The MCL program is Unix-compatible only.
The MCL software, available at http://www.micans.org/mcl/
Oracle or MySQL (see UNIT Unavailable). The orthomclPairs program runs in a relational database. If you don't already have one available, install MySQL. You can do it for free and without significant systems administration support (OrthoMCL uses a relational database as its core technology for speed, robustness and scalability that would have been very hard to achieve otherwise).
Perl 5.8, including DBI libraries
Files
A set of FASTA files, one file per proteome. These must conform to the format (described below) expected by the OrthoMCL software.
1. Follow the
2. Install and configure the relational database. If you are using Oracle, see the included oracleConfigurationGuide.txt. If you are using MySQL, see the included mysqlConfigurationGuide.txt. If you do not have either, see mysqlInstallationGuide.txt to install your own MySQL (see
3. Download the latest software from http://www.micans.org/mcl/src/mcl-latest.tar.gz. Follow the install instructions. Also see http://www.micans.org/mcl/sec_description1.html.
4. Run the orthomclAdjustFasta on your proteome FASTA files to create a new set of FASTA files that are OrthoMCL compliant, and ready to use in step 5, below.
Input:
FASTA files as acquired from the genome resource.
Output:
the my_orthomcl_dir/compliantFasta/ directory of OrthoMCL-compliant FASTA files.
The FASTA-format specification requires that each sequence in a .fasta file be separated by a line beginning with a > character. This line is typically called the definition line and contains information about the sequence. Step 5 below expects your FASTA files to conform to the following specific requirements: (1) they must be in a compliantFasta/ directory which contains all of (and only) your proteome .fasta files, one file per proteome; (2) each .fasta file must have a name in the form xxxx.fasta, where xxxx is a three- or four-letter unique taxon code (e.g., hsa.fasta or eco.fasta).
Each protein in those filesmust have a definition line in the following format:
>xxxx|yyyyyyyy
where xxxx is the three- or four-letter taxon code and yyyyyyy is a sequence identifier unique within that taxon.
Use orthomclAdjustFasta to convert your FASTA files to FASTA files that conform to those requirements. The FASTA files that are input to it must have definition lines (1) with one or more fields that are separated by white space or the | character (optionally surrounded by white space); and (2) that have the protein's unique ID in the same field for every protein.
To use orthomclAdjustFasta:
For any organism that has multiple protein FASTA files, combine them all into one single proteome FASTA file.
Create an empty my_orthomcl_dir/compliantFasta/ directory, and change to that directory.
Run orthomclAdjustFasta with no arguments to get its help. This will tell you about the arguments you need to provide.
Run orthomclAdjustFasta (with appropriate arguments) once for each input proteome FASTA file. It will produce a compliant file in the new directory.
Check each file to ensure that the proteins all have proper IDs.
Benchmark time: ≤ 1 min per genome.
5. Run the orthomclFilterFasta program to remove low-quality protein sequences from your FASTA files, and to combine your individual compliant proteome FASTA files into a single goodProteins.fasta file.
Input:
my_orthomcl_dir/compliantFasta/ (See Step 4)
optionally a gene->protein mapping file
Output:
my_orthomcl_dir/goodProteins.fasta
my_orthomcl_dir/poorProteins.fasta
a report of suspicious proteomes (> 10% poor proteins) on standard output.
This step produces a single goodProteins.fasta file to run BLAST on. It filters away poor-quality sequences (placing them in poorProteins.fasta). The filter is based on sequence length and percent stop codons. You can adjust these values.
To run orthomclFilterFasta:
Change to my_orthomcl_dir/.
Run orthomclFilterFasta to get help, which will show you the options.
Run orthomclFilterFasta with appropriate arguments. Unless you are a power user, use the suggested values.
The program will print the name of any file containing more than 10% rejected proteins, along with the rejected percentage. You should consider removing these from your set of proteomes.
Benchmark time: 5 min.
6. Run all-versus-all BLASTP with goodProteins.fasta as the BLAST database and subject sequences.
Input:
goodProteins.fasta
Output:
your_blast_results_in_tab_format
This document does not provide assistance on the details of running BLASTP. For large datasets you should consider gaining access to a compute cluster. When you do so, you will need to: (1) use NCBI BLAST; (2) run with the -m 8 option to provide tab delimited output required by step 8.
Use these options:
-F 'm S' -v 10000 -b 10000 -z db_size -e 1e-5 –m 8
where: -F 'm S' signifies “mask with Seg”; -v 10000 is a “don't care” value; -b 10000 is a “don't care” value; -z db_size is the number of proteins in the set (see “Incrementally add a genome” below); and -e 1e-5 is the recommended e-value.
If you are a power user, you can deviate from this, so long as you can ultimately provide output in exactly the format provided by NCBI BLAST using the -m 8 option and expected by step 7.
If you are a super-power user, you can deviate from that, and also skip step 7. But you must be able to provide the exact format file created by that step as expected by step 8. The tricky part is computing percent match.
Benchmark time: 3 days
Incrementally add a genome: Once a large all-versus-all BLAST has been completed, you may need to “incrementally” add a new proteome, without re-running the large all-versus-all BLAST. To do so: (1) prepare the new proteome's FASTA file as you did for the previous ones (steps 4 and 5); (2) make a new BLAST database that includes all the previous proteins plus the new FASTA file; (3) use the -z argument of BLAST to simulate the size of the all database, so that the statistics and scoring are compatible with the original all-versus-all BLAST. Use the same -z value as was used in the original BLAST.
7. Use the orthomclBlastParser program to convert the tab-delimited file returned by BLASTP (using the –m 8 option) into a format ready for loading into the OrthoMCL schema in your relational database.
Input:
my_blast_results
my_orthomcl_dir/compliantFasta/
Output:
my_orthomcl_dir/similarSequences.txt
In addition to formatting, this program computes the percent match and percent identity of each hit:
Percent identity: Taken directly from the BLAST result file, from the best HSP per hit.
Percent match: (1) Select whichever is shorter, the query or subject sequence. Call that sequence S. (2) Count all amino acids in S that participate in any HSP. (3) Divide that count by the length of S and multiply by 100.
Use the orthomclBlastParser program as shown below to do the conversion:
% orthomclBlastParser my_blast_results my_orthomcl_dir/compliantFasta >> my_orthomcl_dir/similarSequences.txt
IMPORTANT NOTE: The size of this file determines the disk space required by the relational database. You will need five times the size of this file. Please see the oracleConfigGuide or mysqlConfigGuide now that you know the size of this file.
Benchmark time: 10 min.
8. Use the orthomclLoadBlast program to load the similarSequences.txt file (made in step 7) into the relational database.
Input:
my_orthomcl_dir/similarSequences.txt
my_orthomcl_dir/orthomcl.config
Output:
SimilarSequences table in the database
Use the orthomclLoadBlast program for this.
Benchmark time: 4 hr.
9. Run the orthomclPairs program to find pairs of proteins that are potentially orthologs, in-paralogs or co-orthologs.
Input:
SimilarSequences table in the database
my_orthomcl_dir/orthomcl.config
Output:
Orthologs table
InParalogs table
CoOrthologs table
This is a computationally intensive step that finds protein pairs. It analyzes the BLAST results to find different types of reciprocal best hits (orthologs, co-ortholog, and in-paralog). To do so, it executes the algorithm described in the OrthoMCL Algorithm Document (https://docs.google.com/View?id=dd996jxg_1gsqsp6) using a relational database. The program proceeds through a series of about 20 internal steps, each creating an intermediate database table or index. Finally, it populates the three output tables.
In total, the intermediary tables are expected to be about 50 percent of the size of the SimilarSequences table.
Run the orthomclPairs program with no arguments to see its help.
There are two options to orthomclPairs:
cleanup= allows you to control the cleaning up of the intermediary tables.
yes drops the intermediary tables once they are no longer needed.
no keeps the intermediary tables in the database.
only runs the program but does no work except clean up intermediate tables
all is the same as only but also removes the three final output tables, and should only be done after step 10 (below) has dumped them to files.
startAfter= allows you to pick up where the program left off, if it stops for any reason. Look in the orthomclPairs log to find the last completed step, and use its tag as the value for startAfter=.
Because this program will run for many hours, we recommend you run it using the Unix screen program, so that it does not abort in the middle. If it does, use startAfter=.
Benchmark time: 16 hr.
10. Use the orthomclDumpPairsFiles to create a set of result files from the results in the database made by orthomclPairs.
Input:
OrthoMCL database with populated pairs tables
my_orthomcl_dir/orthomcl.config
Output:
my_orthomcl_dir/pairs/
my_orthomcl_dir/mclInput
Run the orthomclDumpPairsFiles with no arguments, to get its help. Then, change directory to my_orthomcl_dir and run:
% orthomclDumpPairsFiles orthomcl.config
Output files:
pairs/orthologs.txt |
ortholog relationships |
pairs/inparalogs.txt |
in-paralog relationships |
pairs/coorthologs.txt |
co-ortholog relationships |
orthomclMclInput |
file required by the mcl program |
The pairs/ directory contains three files: orthologs.txt, coorthologs.txt, and inparalogs.txt. Each of these files describes pair relationships between proteins. They have three columns: (1) protein 1; (2) protein 2; (3) a normalized similarity score (See the OrthoMCL Algorithm Document for the normalization function).
These are candidate relationships (edges) that will subsequently be grouped (clustered) by the mcl program to form the OrthoMCL ortholog groups. These files contain more sensitive and less selective relationships than the final OrthoMCL groups. The OrthoMCL Algorithm Document (https://docs.google.com/View?id=dd996jxg_1gsqsp6) provides a formal definition of the relationships in these files. Here is a summary:
In-paralogs: all pairs of proteins within a species that have mutual hits that are better or equal to all of those proteins' hits to proteins in other species.
Orthologs: all pairs of proteins across two species that have hits as good as or better than any other hits between these proteins and other proteins in those species.
Co-orthologs: all pairs of proteins across two species that can be connected by following ortholog and in-parolog pairs. For any ortholog pair, the in-paralogs of each will form co-ortholog pairs with each other.
The orthomclMclInput file contains the identical information as the three pairs files but merged into a single file and in a format expected by the mcl program.
Benchmark time: 5 min.
11. Use the mcl program to cluster the pairs found in step 10 into the final OrthoMCL ortholog groups.
Input:
my_orthomcl_dir/mclInput
Output:
my_orthomcl_dir/mclOutput
Use this command to run the mcl program:
% mcl my_orthomcl_dir/mclInput –abc -I 1.5 -o my_orthomcl_dir/mclOutput
Benchmark time: 3 hr.
12. Use the orthomclMclToGroups program to convert the file output by the mcl program into the final OrthoMCL-style groups file.
Input:
my_orthomcl_dir/mclOutput
Output:
my_orthomcl_dir/groups.txt
Change to my_orthomcl_dir/ and run:
orthomclMclToGroups my_prefix 1000 < mclOutput > groups.txt
where:
my_prefix is a string to use as a prefix for your group IDs.
1000 is an arbitrary starting point for your group IDs.
Benchmark time: 1 min.
13. Use the orthomclSingletons program to find proteins that are singletons, i.e., are not included in any group.
Input:
my_orthomcl_dir/groups groups.txt
my_orthomcl_dir/groups goodProteins.txt
Output:
my_orthomcl_dir/groups singletons.txt
Change directory to my_orthomcl_dir/ and run this command:
% orthomclSingletons goodProteins.fasta groups.txt >> singletons.txt
Benchmark time: 1 min.
Support Protocol
: Downloading, Installing, and Configuring the OrthoMCL Programs
Use this support protocol to download, install and configure the OrthoMCL software.
1. Download the OrthoMCL software from http://orthomcl.org/common/downloads/software/v2.0/, and follow these instructions to install it.
Input:
orthomclSoftware.tar
Output:
a directory of executable programs
a home directory for your run of OrthoMCL
the orthomcl.config file.
2. Unpack the software using the following command:
tar -xf orthomclSoftware.tar
The result will look like this:
orthomclSoftware/
bin/
...
doc/
mysqlConfigurationGuide
mysqlInstallGuide
oracleConfigurationGuide
UserGuide.txt
orthomcl.config.template
3. Set the PATH as follows. The orthomclSoftware/bin/ directory has a set of programs. To run the programs you will need to either:
include the orthomclSoftware/bin directory in your PATH. If you don't know how to do this, ask your system administrator.
call the programs using their full directory path.
4. The orthomcl.config file provides the OrthoMCL software with its configuration information. To edit the orthomcl.config file to set various values, start by creating a new directory to hold the configuration file (as well as the data and results for your run of OrthoMCL). In this document, we will call that directory my_orthomcl_dir.
5. In the directory orthomclSoftware/doc/Main/OrthoMCLEngine there is a file called orthomcl.config.template. Copy that file to my_orthomcl_dir/orthomcl.config.
6. If you are using a MySQL database, in the property examples below it is assumed that the MySQL server has a database called orthomcl. To accomplish this, either:
create such a database by going into the server and running create database orthomcl.
use an existing database, and in the orthomcl.config file change the dbConnectString= property (described below) accordingly.
7. Now you can edit the file, setting property values. Listed below are the properties you need to set (and suggested values when appropriate):
dbVendor=
either oracle or mysql
used by orthomclInstallSchema, orthomclLoadBlast, orthomclPairs
dbConnectString=
the string required by Perl DBI to find the database. Examples are:
dbi:Oracle:orthomcl, for an oracle database with service name orthomcl
dbi:MySql:orthomcl, for a centrally installed MySQL server with a database called orthomcl
dbi:MySql:orthomcl:localhost:3307, for a user installed MySQL server on port 3307 with a database called orthomcl
used by orthomclInstallSchema, orthomclLoadBlast, orthomclPairs, orthomclDumpPairsFiles
dbLogin=
your database login name
used by orthomclInstallSchema, orthomclLoadBlast, orthomclPairs, orthomclDumpPairsFiles
dbPassword=
your database password
used by orthomclInstallSchema, orthomclLoadBlast, orthomclPairs, orthomclDumpPairsFiles
similarSequencesTable=SimilarSequences
the name to give the table that will be loaded with BLAST results by orthomclLoadBlast. This is configurable for your flexibility. It doesn't matter what you call it.
used by orthomclInstallSchema, orthomclLoadBlast, orthomclPairs
orthologTable=Ortholog
the name of the table that will hold potential ortholog pairs. This is configurable so that you can run orthomclPairs multiple times, and compare results.
used by orthomclInstallSchema, orthomclPairs, orthomclDumpPairsFiles
inParalogTable=InParalog
the name of the table that will hold potential in-paralog pairs. This is configurable so that you can run orthomclPairs multiple times, and compare results.
used by orthomclInstallSchema, orthomclPairs, orthomclDumpPairsFiles
coOrthologTable=CoOrtholog
the name of the table that will hold potential co-ortholog pairs. This is configurable so that you can run orthomclPairs multiple times, and compare results.
used by orthomclInstallSchema, orthomclPairs, orthomclDumpPairsFiles
interTaxonMatchView=InterTaxonMatch
percentMatchCutoff=50
BLAST similarities with percent match less than this value are ignored.
used by orthomclPairs
See Commentary,
Basic Protocol 2, for more details.evalueExponentCutoff=-5
BLAST similarities with e-value exponents greater than this value are ignored.
used by orthomclPairs
See Commentary,
Basic Protocol 2, for more details.oracleIndexTblSpc=
optional table space to house all oracle indexes, if required by your oracle server. Default is blank.
8. If you are using Oracle, see the included oracleConfigurationGuide.txt. If you are using MySQL, see the included mysqlConfigurationGuide.txt. If you do not have either, see the mysqlInstallationGuide.txt to install your own MySQL (see step 2).
9. Get the latest software from http://www.micans.org/mcl/src/mcl-latest.tar.gz . Follow the install instructions. Also see: http://www.micans.org/mcl/sec_description1.html.
10. Run the orthomclInstallSchema program to install the OrthoMCL schema into your relational database.
Input:
a relational database
my_orthomcl_dir/orthomcl.config
Output:
database with schema installed
Run the orthmclInstallSchema program to install the schema. Run the program with no arguments to get help. This is true of all following OrthoMCL programs.
Benchmark time: < 1 min.
Guidelines for Understanding Results
Basic Protocol
1: Assign your proteome to OrthoMCL-DB groups
Your result will contain three files:
orthologGroups. This file is a mapping between your proteins and OrthoMCL-DB groups. It is tab-delimited text. (See Fig. 3 for a sample.) The columns are:
Your protein ID
The ID of the OrthoMCL-DB group it is provisionally mapped to
The ID of the protein sequence in OrthoMCL-DB that is its best hit.
The e-value mantissa of that hit
The e-value exponent of that hit
The percent identity of that hit
The percent match of that hit
paralogPairs. This file contains reciprocal best hits (from an all-versus-all BLASTP of your proteins) among those proteins in your genome that were not mapped to OrthoMCL groups. The scores are normalized (see the OrthoMCL Algorithm Document).
paralogGroups. This file is a grouping of your proteins into provisional in-paralog groups. Only proteins that appear in the paralogPairs file are considered. Those pairs are passed to the MCL program, and this file is the result. Each row represents a provisional in-paralog group.
The orthologGroups and paralogGroups file represent approximate assignments to OrthoMCL-DB groups because the full OrthoMCL algorithm has not been run. Group assignments are made based only on the best BLAST hit. In most cases, particularly if the hit characteristics are good, this assignment is identical to what would be found if the full OrthoMCL algorithm were run. However, for weaker hits, the assignment is less reliable. A small percentage (<10%) of assignments may be different if the full algorithm were run.
Basic Protocol
2: Create ortholog groups from your proteomes using the OrthoMCL software
The primary results file for
Clustering will be affected by separation in the (similarity) distances between proteins. For example, if you include only proteomes from evolutionarily distant organisms, distantly related proteins might end up in the same ortholog group. As you add more proteomes, more distantly related proteins would be expected to be split up among multiple groups. To a certain degree, this phenomenon is driven by the presence of multiple domains in proteins. Ortholog groups created based on sharing one protein domain may be split into groups sharing additional domains as more closely related proteomes are added.
In-paralogs in proteomes that are not closely related may be put into separate groups if there is a closely related proteome that also has that set of paralogs. For example, if the proteome set used is heavily biased phylogenetically, e.g., containing many bacteria and few eukaryotes, and if the chosen eukaryotes are distantly related, then many of the eukaryotic genes in the proteome set may remain ungrouped (singletons) or be grouped as in-paralogs. This is well illustrated in OrthoMCL-DB Version 4. It includes only one ciliate—Tetrahymena thermophila. When another ciliate proteome, Paramecium tetraurelia, is grouped (using
Commentary
Background Information
Basic Protocols 1 and 2 use the same fundamental approach, which is the OrthoMCL algorithm for finding protein pairs. In
Critical Parameters and Troubleshooting
Basic Protocol
1
A troubleshooting guide for
Problem |
Possible cause |
Solution |
---|---|---|
Receiving mail that says your job failed |
You submitted an invalid FASTA file |
Read the FASTA format description at http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml. Submit a valid FASTA file |
You submitted a FASTA file with non-unique IDs per sequence |
Give each sequence a unique ID, and resubmit. |
Basic Protocol
2
There are two orthomclPairs parameters (see Steps 1 and 9) that can affect how distantly related proteins are included in groups. The first parameter is the evalueExponentCutoff. The recommended value is −5, which means that all pairs found by the program will have a BLAST e-value score of 1e-5 or less (in both directions). This may seem inclusive, but there are other stringent requirements for pairs (see the OrthoMCL Algorithm Document). Also, the MCL program recognizes the difference between weak and strong pairs, and avoids combining groups unnecessarily. The main effect of the weak threshold is to avoid early elimination of distant relationships. The second orthomclPairs parameter is percentMatchCutoff, with a recommended setting of 50. This is not the same as percent identity. It requires that for two proteins to form a pair, the number of amino acids from the shorter protein that were included in any of the pair's HSPs must be at least 50% of the length of that protein. This ensures that the majority of the shortest protein is involved in the match, increasing the likelihood that the two proteins will share domains. You might consider altering either or both of these parameters to compare the resulting groups, but our experience is that these settings work well.
Advanced Parameters
Basic Protocol
2
In step 11, you specify a –I argument to the mcl program. This is the inflation parameter for the Markov clustering algorithm and it ranges from 1 to 4. Lower inflation values result in the inclusion of more sequences in fewer groups. The impact of inflation choice is discussed in the original OrthoMCL paper (Li et al., 2003). Tight clustering tends to prevent sequences with different functions from being clustered together, but may also separate true orthologs. OrthoMCL uses an inflation of ∼1.5 to balance sensitivity and selectivity based on grouping of enzymes and their E.C. numbers.
Suggestions for Further Analysis
Basic Protocol
1
Your proteins are now mapped to OrthoMCL-DB groups. Each group has its own page at the OrthoMCL-DB Web site. To find out more about a particular protein, go the OrthoMCL-DB site and review the page for that protein's group.
Ortholog grouping can provide information regarding the evolutionary origin and functional conservation of proteins. Using the phyletic pattern searches (under the Search menu) in OrthoMCL-DB, it is possible to map the phyletic profile of the various ortholog groups that have been mapped to the proteins or proteomes of interest. The easiest way to do this is by uploading the text file containing the list of ortholog groups (obtained using Basic Protocols 1 and 2) using the upload tool in the history page for group searches. Once these uploaded ortholog groups are retrieved as a query result in the history page, then they can be combined with results from various phyletic pattern searches. This approach can be useful in identifying genes acquired by horizontal gene transfers in species of interest. For example, when a protein from a species of interest has orthologs in distantly related organisms but not in more recent ancestral species, it would suggest that the gene was acquired by the species of interest through horizontal gene transfer rather than by vertical inheritance as a result of speciation. This sort of information could be useful for identifying druggable targets in parasitic organisms.
Functional annotation based on ortholog grouping is enhanced when information regarding Pfam (UNIT Unavailable; Bateman et al., 2004) domains and Enzyme Commission (E.C.) numbers (Webb and International Union of Biochemistry and Molecular Biology, 1992) are mapped to individual groups. In OrthoMCL-DB Version 4, a text file providing the mapping of all Pfam domains for all ortholog groups is available on the download site (accessed from the Data menu). This file also gives the frequency of occurrence of any given domain (a value of 1 indicates that all proteins in that group have that domain) which can be used to guide functional annotation of new proteomes. Although there are no data mapping EC numbers in OrthoMCL-DB Version 4, a similar mapping strategy to that of Pfam domain mapping can be implemented.
Basic Protocol
2
For a particular group of interest, create a FASTA file with that group's protein sequences. Submit the FASTA file to a multiple sequence alignment program or to Pfam analysis (see Internet Resources, below, and UNIT Unavailable). To visualize the connectedness of the group, download and install the Biolayout program (see Internet Resources, below). To get a sense of what these tools will do, go the OrthoMCL-DB Web site and select a group of interest. These tools will be available on that group's page.
You can also use OrthoMCL groups to assign functional annotation to un-annotated proteins. If some of your proteomes have good functional assignments, you can use those assignments to assess the consistency of groups. When a group has consistent annotations, you can assign those functions to other members of the group with confidence. E.C. number (Li et al., 2003; Chen et al., 2006) and Gene Ontology (GO; see UNIT Unavailable; also see The Gene Ontology Consortium, 2000) function assignments are robust for this sort of analysis since these ontologies are directed acyclic graphs with less specific assignments higher up in the graph.
Acknowledgments
This work was supported by the National Institute of Allergy and Infectious Diseases at the National Institutes of Health [Award NO1-AI90038C Contract No. HHSN27220090038C to D.S.R., C.J.S. and J.C.K.]. The authors thank Deborah F. Pinney for her contributions in populating the OrthoMCL-DB database used in