Reference proteomes - Primary proteome sets for the Quest For Orthologs

RELEASE 2023_03 based on UniProt Release 2023_03, Ensembl release 109 and Ensembl Genome release 56

Introduction

The Reference Proteomes group provides complete non-redundant proteome sets for species chosen by the “Quest for Orthologs” group. It comprises 79 species that are publicly available and are generated using UniProtKB, Ensembl and Ensembl Genomes.

The one gene one protein proteome sets are compiled from species sourced from complete genomes submitted to INSDC with gene model annotations from:

  1. genome submitters
  2. Ensembl or Ensembl genomes

Download

The gene2acc, fasta and idmapping files for individual species are available for download here:
https://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO

or as a tarball of all species:
https://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/QfO_release_2023_03.tar.gz

SeqXML versions are documented by our partners and they are available here:
https://www.seqxml.org/xml/Reference_proteomes.html

Predicted Orthologs

InParanoid

OMA

Orthoinspector

Current Composition of Primary Protein Sets

The following table describes the status of the species:

Species Number of Genes/Proteins
UP000007062 7165 ANOGA Anopheles gambiae 13016
UP000000798 224324 AQUAE Aquifex aeolicus 1553
UP000006548 3702 ARATH Arabidopsis thaliana 27481
UP000001570 224308 BACSU Bacillus subtilis 4260
UP000001414 226186 BACTN Bacteroides thetaiotaomicron 4782
UP000007241 684364 BATDJ Batrachochytrium dendrobatidis 8610
UP000009136 9913 BOVIN Bos taurus 23841
UP000002526 224911 BRADU Bradyrhizobium diazoefficiens 8253
UP000001554 7739 BRAFL Branchiostoma floridae 26627
UP000001940 6239 CAEEL Caenorhabditis elegans 19827
UP000000559 237561 CANAL Candida albicans 6035
UP000805418 9615 CANLF Canis lupus 20972
UP000000431 272561 CHLTR Chlamydia trachomatis 895
UP000006906 3055 CHLRE Chlamydomonas reinhardtii 17614
UP000002008 324602 CHLAA Chloroflexus aurantiacus 3850
UP000008144 7719 CIOIN Ciona intestinalis 16680
UP000002149 214684 CRYNJ Cryptococcus neoformans 6604
UP000000437 7955 DANRE Danio rerio 26249
UP000002524 243230 DEIRA Deinococcus radiodurans 3084
UP000007719 515635 DICTD Dictyoglomus turgidum 1743
UP000002195 44689 DICDI Dictyostelium discoideum 12726
UP000000803 7227 DROME Drosophila melanogaster 13821
UP000000625 83333 ECOLI Escherichia coli 4403
UP000002521 190304 FUSNN Fusobacterium nucleatum 2046
UP000000539 9031 CHICK Gallus gallus 18369
UP000000577 243231 GEOSL Geobacter sulfurreducens 3402
UP000001548 184922 GIAIC Giardia intestinalis 4900
UP000000557 251221 GLOVI Gloeobacter violaceus 4406
UP000001519 9595 GORGO Gorilla gorilla 21783
UP000000554 64091 HALSA Halobacterium salinarum 2423
UP000000429 85962 HELPY Helicobacter pylori 1554
UP000015101 6412 HELRO Helobdella robusta 23328
UP000005640 9606 HUMAN Homo sapiens 20586
UP000001555 6945 IXOSC Ixodes scapularis 20496
UP000001686 374847 KORCO Korarchaeum cryptofilum 1602
UP000000542 5664 LEIMA Leishmania major 8038
UP000018468 7918 LEPOC Lepisosteus oculatus 18321
UP000001408 189518 LEPIN Leptospira interrogans 3676
UP000000805 243232 METJA Methanocaldococcus jannaschii 1787
UP000002487 188937 METAC Methanosarcina acetivorans 4468
UP000002280 13616 MONDO Monodelphis domestica 21223
UP000001357 81824 MONBE Monosiga brevicollis 9188
UP000000589 10090 MOUSE Mus musculus 21957
UP000001584 83332 MYCTU Mycobacterium tuberculosis 3995
UP000000807 243273 MYCGE Mycoplasma genitalium 483
UP000000425 122586 NEIMB Neisseria meningitidis 2001
UP000001593 45351 NEMVE Nematostella vectensis 24427
UP000002530 330879 ASPFU Neosartorya fumigata 9647
UP000001805 367110 NEUCR Neurospora crassa 9759
UP000000792 436308 NITMS Nitrosopumilus maritimus 1795
UP000059680 39947 ORYSJ Oryza sativa 43672
UP000001038 8090 ORYLA Oryzias latipes 23617
UP000002277 9598 PANTR Pan troglodytes 23051
UP000000600 5888 PARTE Paramecium tetraurelia 39461
UP000001055 321614 PHANO Phaeosphaeria nodorum 15998
UP000006727 3218 PHYPA Physcomitrella patens 31359
UP000005238 164328 PHYRM Phytophthora ramorum 15349
UP000001450 36329 PLAF7 Plasmodium falciparum 5372
UP000002438 208964 PSEAE Pseudomonas aeruginosa 5564
UP000008783 418459 PUCGT Puccinia graminis 15688
UP000002494 10116 RAT Rattus norvegicus 22870
UP000001025 243090 RHOBA Rhodopirellula baltica 7271
UP000002311 559292 YEAST Saccharomyces cerevisiae 6060
UP000002485 284812 SCHPO Schizosaccharomyces pombe 5122
UP000001312 665079 SCLS1 Sclerotinia sclerotiorum 14445
UP000001973 100226 STRCO Streptomyces coelicolor 8035
UP000001974 273057 SACS2 Sulfolobus solfataricus 2937
UP000001425 1111708 SYNY3 Synechocystis sp. 3507
UP000001449 35128 THAPS Thalassiosira pseudonana 11717
UP000000536 69014 THEKO Thermococcus kodakarensis 2301
UP000000718 289376 THEYD Thermodesulfovibrio yellowstonii 1982
UP000008183 243274 THEMA Thermotoga maritima 1852
UP000007266 7070 TRICA Tribolium castaneum 16568
UP000001542 412133 TRIV3 Trichomonas vaginalis 50190
UP000000561 237631 USTMA Ustilago maydis 6788
UP000186698 8355 XENTR Xenopus laevis 35860
UP000008143 8364 XENTR Xenopus tropicalis 22229
UP000001300 284591 YARLI Yarrowia lipolytica 6449
UP000007305 4577 MAIZE Zea mays 39225

Gene mapping files (*.gene2acc)

Column 1 is a unique gene symbol that is chosen with the following order of preference from the annotation found in:

  1. Model Organism Database (MOD)
  2. Ensembl or Ensembl Genomes database
  3. UniProt Ordered Locus Name (OLN)
  4. UniProt Open Reading Frame (ORF)
  5. UniProt Gene Name

A dash symbol (-) is used when the gene encoding a protein is unknown.

Column 2 is the UniProtKB accession or isoform identifier for the given gene symbol. This column may have redundancy when two or more genes have identical translations.

Column 3 is the gene symbol of the canonical accession used to represent the respective gene group and the first row of the sequence is the canonical one.

Protein FASTA files (*.fasta and *_additional.fasta)

These files, composed of canonical and additional sequences, are non-redundant FASTA sets for the sequences of each reference proteome. The additional set contains isoform/variant sequences for a given gene, and its FASTA header indicates the corresponding canonical sequence ("Isoform of ..."). The FASTA format is the standard UniProtKB format.

For further references about the standard UniProtKB format, please see:

https://www.uniprot.org/help/fasta-headers
https://www.uniprot.org/help/retrieve_sets

E.g. Canonical set:

    >sp|Q9H6Y5|MAGIX_HUMAN PDZ domain-containing protein MAGIX OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=4
    MEPRTGGAANPKGSRGSRGPSPLAGPSARQLLARLDARPLAARAAVDVAALVRRAGATLR
    LRRKEAVSVLDSADIEVTDSRLPHATIVDHRPQHRWLETCNAPPQLIQGKAHSAPKPSQA
    SGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHINGE
    STQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGGP
    EVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGSP
    GPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH

 

E.g. Additional sets:

    >sp|Q9H6Y5-2|MAGIX-2_HUMAN Isoform of Q9H6Y5, Isoform 2 of PDZ domain-containing protein MAGIX OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=4
    MPLLWITGPRYHLILLSEASCLRANYVHLCPLFQHRWLETCNAPPQLIQGKAHSAPKPSQ
    ASGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHING
    ESTQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGG
    PEVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGS
    PGPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH

    >tr|C9J123|C9J123_HUMAN Isoform of Q9H6Y5, PDZ domain-containing protein MAGIX (Fragment) OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=2
    MSPNSPLHCFYLPAVSVLDSADIEVTDSRLPHATIVDHRPQVGDLVLHINGESTQGLTHA
    QAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVDRSPDPGGPEVTGSRSSSTSLV
    QHPPSRTTLKKTRGSPEPSPEAA

 

Coding DNA Sequence FASTA files (*_DNA.fasta)

These files contain the coding DNA sequences (CDS) for the protein sequences where it was possible. The format is as in the following example (UP000005640_9606_DNA.fasta):

    >sp|A0A183|ENSP00000411070
    ATGTCACAGCAGAAGCAGCAATCTTGGAAGCCTCCAAATGTTCCCAAATGCTCCCCTCCC
    CAAAGATCAAACCCCTGCCTAGCTCCCTACTCGACTCCTTGTGGTGCTCCCCATTCAGAA
    GGTTGTCATTCCAGTTCCCAAAGGCCTGAGGTTCAGAAGCCTAGGAGGGCTCGTCAAAAG
    CTGCGCTGCCTAAGTAGGGGCACAACCTACCACTGCAAAGAGGAAGAGTGTGAAGGCGAC
    TGA

 

The 3 fields of the FASTA header are:

  1. sp (Swiss-Prot reviewed) or tr (TrEMBL)
  2. UniProtKB Accession
  3. EMBL Protein ID or Ensembl/Ensembl Genome ID

Unsuccessful Coding DNA Sequence mapping files (*_DNA.miss)

For the species that did not have a perfect mapping for all protein sequences to a CDS, these files contain the entries that could not be mapped. The format is as in the following example (UP000005640_9606_DNA.miss):

    sp A6NF01 CAUTION: Could be the product of a pseudogene
    sp A4QN01 NOT_ANNOTATED_CDS

 

The 3 fields are:

  1. "sp" (Swiss-Prot reviewed) or "tr" (TrEMBL)
  2. UniProtKB accession
  3. Reason why the protein could not be mapped to a CDS

Database mapping files (*.idmapping)

These files contain mappings from UniProtKB to other databases for each reference proteome. The format consists of three tab-separated columns:

  1. UniProtKB accession
  2. ID_type:
  3. ID:
    • Identifier in the cross-referenced database.

SeqXML files (*.xml)

The xml files contain all the information from fasta (canonical and additional), idmapping and CDS in SeqXML format (see https://seqxml.org.)
E.g. (from UP000005640_9606.xml, header and one entry)

<?xml version="1.0" encoding="utf-8"?> 
<seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" speciesName="Homo sapiens" xsi:noNamespaceSchemaLocation="http://www.seqxml.org/0.4/seqxml.xsd" seqXMLversion="0.4" sourceVersion="2016_04"
source="QfO http://w ww.ebi.ac.uk/reference_proteomes/" ncbiTaxID="9606"> 
<entry id="A0A075B6H9" source="UniProtKB">
<description>Immunoglobulin lambda variable 4-69</description>
<AAseq>MAWTPLLFLTLLLHCTGSLSQLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRYLMKLNSDGSHSKGDGIPDRFSGSSSGAERYLTISSLQSEDEADYYCQTWGTGI</AAseq>
<DBRef id="LV469_HUMAN" source="UniProtKB-ID"></DBRef>
<DBRef id="IGLV4-69" source="Gene_Name"></DBRef>
<DBRef id="1133968632" source="GI"></DBRef>
<DBRef id="UniRef100_A0A075B6H9" source="UniRef100"></DBRef>
<DBRef id="UniRef90_A0A075B6H9" source="UniRef90"></DBRef>
<DBRef id="UniRef50_A0A0B4J1Y8" source="UniRef50"></DBRef>
<DBRef id="UPI0000F30329" source="UniParc"></DBRef>
<DBRef id="AC245452" source="EMBL"></DBRef>
<DBRef id="-" source="EMBL-CDS"></DBRef>
<DBRef id="9606" source="NCBI_TaxID"></DBRef>
<DBRef id="ENSG00000211637" source="Ensembl"></DBRef>
<DBRef id="ENST00000390282" source="Ensembl_TRS"></DBRef>
<DBRef id="ENSP00000374817" source="Ensembl_PRO"></DBRef>
<DBRef id="uc062cba.1" source="UCSC"></DBRef>
<DBRef id="HostDB:ENSG00000211637.2" source="EuPathDB"></DBRef>
<DBRef id="IGLV4-69" source="GeneCards"></DBRef>
<DBRef id="HGNC:5921" source="HGNC"></DBRef>
<DBRef id="NX_A0A075B6H9" source="neXtProt"></DBRef>
<DBRef id="ENSGT00900000140867" source="GeneTree"></DBRef>
<DBRef id="GPRYLMK" source="OMA"></DBRef>
<DBRef id="6C449213D2CD44D7" source="CRC64"></DBRef>
<property name="GN" value="IGLV4-69"></property>
<property name="SV" value="1"></property>
<property name="DNAsource" value="ENSP00000374817"></property>
<property name="ensemblVersion" value="91,38"></property>
<property name="UPID" value="UP000005640"></property>
<property name="DNAseq" value="ATGGCTTGGACCCCACTCCTCTTCCTCACCCTCCTCCTCCACTGCACAGGGTCTCTCTCCCAGCTTGTGCTGACTCAATCGCCCTCTGCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGCAGCTACGCCATCGCAT
GGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGATGGCAGCCACAGCAAGGGGGACGGGATCCCTGATCGCTTCTCAGGCTCCAGCTCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTCA"></property>
<property name="PE" value="1"></property>
</entry>

Joining forces in the quest for orthologs

Toni Gabaldón, Christophe Dessimoz, Julie Huxley-Jones, Albert J Vilella,Erik LL Sonnhammer and Suzanna Lewis

Genome Biology 2009, 10:403

Published: 29 September 2009

Toward community standards in the quest for orthologs

Christophe Dessimoz, Toni Gabaldón, David S. Roos, Erik LL Sonnhammer, Javier Herrero and the Quest for Orthologs consortium

Bioinformatics 2012, 28:900

Published: 12 February 2012

Big data and other challenges in the quest for orthologs

Erik LL Sonnhammer, Toni Gabaldón, Alan W Sousa da Silva, Maria Martin, Marc Robinson-Rechavi, Brigitte Boeckmann, Paul D Thomas, Christophe Dessimoz and the Quest for Orthologs consortium

Bioinformatics 2014, 30:2993

Published: 26 July 2014

Standardized benchmarking in the quest for orthologs

Adrian M Altenhoff, Brigitte Boeckmann, Salvador Capella-Gutierrez, Daniel A Dalquen, Todd DeLuca, Kristoffer Forslund, Jaime Huerta-Cepas, Benjamin Linard, Cécile Pereira, Leszek P Pryszcz, Fabian Schreiber, Alan Sousa da Silva, Damian Szklarczyk, Clément-Marie Train, Peer Bork, Odile Lecompte, Christian von Mering, Ioannis Xenarios, Kimmen Sjölander, Lars Juhl Jensen, Maria J Martin, Matthieu Muffato, Quest for Orthologs consortium, Toni Gabaldón, Suzanna E Lewis, Paul D Thomas, Erik Sonnhammer and Christophe Dessimoz

Nature Methods 2016

Published online: 25 April 2018