- Split View
-
Views
-
Cite
Cite
Manfred Koegl, Peter Uetz, Improving yeast two-hybrid screening systems, Briefings in Functional Genomics, Volume 6, Issue 4, December 2007, Pages 302–312, https://doi.org/10.1093/bfgp/elm035
- Share Icon Share
Abstract
Yeast two-hybrid (Y2H) screening methods are an effective means for the detection of protein–protein interactions. Optimisation and automation has increased the throughput of the method to an extent that allows the systematic mapping of protein–protein interactions on a proteome-wide scale. Since two-hybrid screens fail to detect a great number of interactions, parallel high-throughput approaches are needed for proteome-wide interaction screens. In this review, we discuss and compare different approaches for adaptation of Y2H screening to high-throughput, the limits of the method and possible alternative approaches to complement the mapping of organism-wide protein–protein interactions.
INTRODUCTION
To understand the function of a protein, it is useful to know to which other proteins it can bind. For decades, this simple idea has been motivating researchers to look for binding partners of their favourite proteins. Since the biochemical isolation of protein complexes is a tedious and demanding process, alternative methods to find potential binding partners are welcome. Yeast two-hybrid (Y2H) screening [1] has emerged as the most successful of these methods, and has been quickly and widely accepted by the research community. The method has been automated and used in several large-scale projects, including the first drafts of protein interaction maps for humans and several model organisms. In the following, we compare different approaches for adaptation of the method to high-throughput processing, discuss the limits of the method, ways to select reliable interactions from the mass of the screening data and alternative approaches to complement the mapping of organism-wide protein–protein interactions.
HOW THE Y2H SYSTEM WORKS
The basic idea of all two-hybrid methods is to split a protein in two halves that do not work independently, but will work if they can be somehow brought together again. When the two fragments are expressed as fusion proteins (‘hybrids’) with two other proteins that have sufficient affinity for each other, the two parts of the split protein are combined again and its function is restored.
In the most common application of this idea a transcription factor is split into the separate domains that harbour (i) the DNA-binding activity and (ii) the transcriptional activation function (Figure 1). The reconstitution of the transcription factor is detected by the transcriptional activation of a reporter gene. Commonly used reporters generate a colorimetric or fluorescent readout, or allow growth on selective media. For example, in an yeast strain that lacks a functional HIS3 gene the wild-type HIS3 gene as a reporter allows for the selection of interaction-positive colonies in histidine-free medium.
FISHING IN POOLS OF CDNAS
Reporters that result in selective cell growth allow the enrichment of positive colonies against a background of negative cells. Using this method, complex libraries can be screened for interacting ‘prey’ proteins with a ‘bait’ protein of interest. In the early applications of Y2H screening, cDNA pools were based on oligo (dT) or random primed cDNAs prepared from the mRNA of diverse tissues, and cloned into a plasmid suitable for Y2H screening [2, 3]. In case of yeast or prokaryotes, fragmented genomic DNA can be used instead of cDNA [4, 5]. For screening, library clones are pooled, and yeast cells harbouring interacting bait and prey proteins are enriched by use of reporters such as HIS3.
Screening of pooled libraries has been the typical use of the Y2H system in academic labs aiming at the isolation of binding partners for a protein of interest. As suggested by Figure 2, this method is still widely and successfully applied. A disadvantage of libraries created by the cloning of pools of DNA fragments is the uncontrolled fashion in which the coding sequences of the inserts are attached to the coding sequence of the split transcription factor. In many cases, the hybrid protein will be expressed in the wrong reading frame or from the 5′ or 3′ untranslated regions of the mRNA. The resulting non-natural proteins provide a rich source for non-specific interactions that often litter the results of Y2H screens, and add to the number of false positives that occur in Y2H screens. To minimize false positives, the molecular details of the method, such as the reporter gene constructs and expression vectors for hybrid proteins, have been fine-tuned in many aspects (reviewed in [6–9]), significantly reducing the noise of non-biological interactions. For an initial filtering of the raw interaction data, several technical parameters from Y2H screens are useful. These include the number of different reporters activated by an interaction event, and the level of reporter gene activation. Interactions that do not get past the hurdle of these criteria are usually not reported in publications although some authors have argued that all raw data (including possible false positives) should be released so that they can be used for further improvement of filtering strategies [10].
ARRAYS OF PREYS
Automation of the rate-limiting steps of the method, such as plating of cells for the selection of positives, picking of positive clones and determination of the interaction signal allows taking Y2H screening to a larger scale, including systematic analyses of protein–protein interactions of whole organisms. As shown in Figure 2, such systematic screens make up a sizeable proportion of currently reported protein–protein interactions identified by Y2H methods. A prerequisite of large, systematic Y2H screening is the availability of cDNA clones encompassing the coding regions of the bait proteins in a suitable vector. Collections of individually cloned cDNAs comprising the full-length open reading frame (ORF) of the mRNA are currently being generated for several species (reviewed in [11–13]), in part in dedicated efforts to provide resources for Y2H screening [14–16]. The use of recombinational cloning systems facilitates the shuttling of the coding systems between vectors, such that the ORFs can be readily transferred to plasmids appropriate for the expression of fusions with the DNA-binding domain or activation domain in yeast. Such ‘ORFeome’ collections allow a novel strategy for Y2H screening: instead of enriching interacting clones from a mixed pool, the individual clones are tested one by one for an interaction with the bait protein. Typically, the cDNA collection is presented in an arrayed form, and each position in the array is tested pair-wise for interaction signals with a bait protein (see also Figure 3).
This approach has several advantages to the screening of pooled preys (see also Table 1):
The identity of the arrayed proteins is known, such that it is not necessary to isolate and sequence the library insert.
The absence of fusions that are in the wrong reading frame or correspond to non-coding DNA avoids interaction signals from non-natural peptides.
The library is normalised with respect to the representation of each protein. This is in stark contrast to classical cDNA libraries, in which cDNAs from highly expressed mRNAs are overrepresented and cDNAs from lowly expressed mRNAs are rare, which results in difficulties to find interactions with lowly expressed mRNAs.
Pair-wise tests for interactions are more sensitive than screens of large cDNA pools, probably because weak signals can be distinguished from background more easily. For example, the number of interactions found by Uetz et al. [17] was much larger than the number found by the pooled library used by Ito et al. [18] when the same bait was used. However, random libraries regularly find more interactions than array screens because they include fragments that may interact while full-length proteins may not (e.g. Fromont-Racine et al. [19] versus Uetz et al. [17] or Rain et al. [20] versus Parrish et al. [21]).
. | Pools . | Arrays . |
---|---|---|
Detection of interactions | Selective growth of positive clones→ enrichment from large pools | Pair-wise tests |
Clone identification | Sequencing of the library insert | Position on the array encodes the identity of the insert |
Library complexity (typically) | Several million | Few to thousands |
Libraries screened (typically) | Randomly cloned cDNA fragments | Individually cloned full-length ORFs |
Number of tests in systematic screens | Number of screens required is directly proportional to the number of baits (but more clones need to be analysed per screen) | Number of tests required increases with the square of the number of proteins to be analysed |
Promiscuous preys | Recognised upon repeated screening of the library. Cannot be removed from the pool | Recognised upon repeated screening of the library, and removed. |
Saturation | Hard to approach (e.g. ref. 78: saturation is reached in >500 screens) | Saturation can be approached in a few screens. |
. | Pools . | Arrays . |
---|---|---|
Detection of interactions | Selective growth of positive clones→ enrichment from large pools | Pair-wise tests |
Clone identification | Sequencing of the library insert | Position on the array encodes the identity of the insert |
Library complexity (typically) | Several million | Few to thousands |
Libraries screened (typically) | Randomly cloned cDNA fragments | Individually cloned full-length ORFs |
Number of tests in systematic screens | Number of screens required is directly proportional to the number of baits (but more clones need to be analysed per screen) | Number of tests required increases with the square of the number of proteins to be analysed |
Promiscuous preys | Recognised upon repeated screening of the library. Cannot be removed from the pool | Recognised upon repeated screening of the library, and removed. |
Saturation | Hard to approach (e.g. ref. 78: saturation is reached in >500 screens) | Saturation can be approached in a few screens. |
. | Pools . | Arrays . |
---|---|---|
Detection of interactions | Selective growth of positive clones→ enrichment from large pools | Pair-wise tests |
Clone identification | Sequencing of the library insert | Position on the array encodes the identity of the insert |
Library complexity (typically) | Several million | Few to thousands |
Libraries screened (typically) | Randomly cloned cDNA fragments | Individually cloned full-length ORFs |
Number of tests in systematic screens | Number of screens required is directly proportional to the number of baits (but more clones need to be analysed per screen) | Number of tests required increases with the square of the number of proteins to be analysed |
Promiscuous preys | Recognised upon repeated screening of the library. Cannot be removed from the pool | Recognised upon repeated screening of the library, and removed. |
Saturation | Hard to approach (e.g. ref. 78: saturation is reached in >500 screens) | Saturation can be approached in a few screens. |
. | Pools . | Arrays . |
---|---|---|
Detection of interactions | Selective growth of positive clones→ enrichment from large pools | Pair-wise tests |
Clone identification | Sequencing of the library insert | Position on the array encodes the identity of the insert |
Library complexity (typically) | Several million | Few to thousands |
Libraries screened (typically) | Randomly cloned cDNA fragments | Individually cloned full-length ORFs |
Number of tests in systematic screens | Number of screens required is directly proportional to the number of baits (but more clones need to be analysed per screen) | Number of tests required increases with the square of the number of proteins to be analysed |
Promiscuous preys | Recognised upon repeated screening of the library. Cannot be removed from the pool | Recognised upon repeated screening of the library, and removed. |
Saturation | Hard to approach (e.g. ref. 78: saturation is reached in >500 screens) | Saturation can be approached in a few screens. |
The number of pair-wise tests in such a matrix screen increases with the square of the number of proteins in the matrix. This is the reason why in practice, most large-scale projects have initially screened mini-pools of clones, rather than protein pairs and then further analysed them by sequencing [17–19, 22, 23] or selective pair-wise tests [17]. More recently, several studies used smart pooling strategies [79], pools of baits [24] or preys [25] which were de-convoluted to obtain individual protein pairs after mating and selecting these pools.
LARGE-SCALE PROTEIN INTERACTION SCREENING
Both approaches, screening of pooled libraries as well as matrix-type screening of arrayed cDNA libraries, have been automated and used for large-scale interaction maps (Figure 2). An early project dealing with the intra-viral protein interactions of the bacteriophage T7 showed that large protein interaction mapping projects are feasible [26]. This first step was soon followed by large-scale protein–protein interaction mapping projects for bacteria (Helicobacter pylori [20], Campylobacter jejuni [21]), yeast [17–19], plants [27], human viruses [28], Plasmodium falciparum [29] and higher eukaryotes (Caenorhabditis elegans [30], Drosophila melanogaster [22, 31, 32]). Several protein interaction networks for human proteins have been generated for specific areas of interest, such as signal transduction and biochemical pathways [33–35], protein families [36–39], subcellular structures or virus–host interactions [40]. Two groups have recently reported unbiased large interaction screens with the goal of outlining the first draft of the human interactome [23, 24]. All these data have proven to be rich sources of biologically relevant information.
ASSESSMENT OF Y2H DATA
In classical projects, Y2H-based data were only published once the interactions had been tested and confirmed in independent experiments. This has simply not been possible for large-scale Y2H experiments, since the acceleration in data production by Y2H analysis has not been matched yet by the improvements of ‘confirming’ methods, such as co-immunoprecipitations. Thus, the bad news is that the new data sources are afflicted with uncertainties that need to be taken into consideration for their use. The good news is that the sheer mass of data allows the selection of reliable data by quantitative, partially statistical criteria. Such criteria mainly include the reproducibility of the interaction and the definition and exclusion of promiscuous interactors, as outlined in the following two sections.
TECHNICAL VERSUS BIOLOGICAL ARTEFACTS
For the discussion of artefacts and their elimination, it is helpful to distinguish technical artefacts, in which an interaction signal is generated by events other than a protein–protein interaction, from biological artefacts, where proteins truly interact, but only when artificially co-expressed [41]. For example, proteins may interact in a Y2H assay without ever being naturally expressed in the same cell. In contrast to technical artefacts, biological artefacts are genuine interactions of bait and prey, and cannot be eliminated by technical controls. In fact, when tested in alternative protein interaction assays, biological false positives will mostly be confirmed. Also, it is hard to define false positives with certainty, since it is impossible to give experimental proof that two proteins do under no instances bind to each other.
CRITERIA FOR SELECTING RELIABLE INTERACTIONS
We will discuss five categories of selection methods for reliable interactions, which are based on (i) the reproducibility of interactions, (ii) the promiscuity of interaction partners, (iii) network topology, (iv) comparisons with external data and (v) evolutionary conservation of interaction partners.
Reproducibility: Most technical artefacts are either reproducible, or rare. Rare artefacts can arise e.g. from mutations that artificially generate interaction signals. The likelihood that a rare event occurs twice independently in cells harbouring cDNAs from the same protein is extremely low. Thus, the removal of interactions that are not reproduced within the data set can be used to weed out such rare technical artefacts [22, 30, 39, 42–44].
Promiscuity: Reproducible artefacts are e.g. interaction signals that arise from non-specific binding of the prey to the bait protein chimera. Such artificial activators of the reporter genes become apparent as ‘promiscuous’ preys when a library or an array is repeatedly screened, since they appear to bind to a great number of unrelated baits (Figure 3). These artefacts can be eliminated from the data set by removing all preys that display promiscuities above a threshold level. The cut-off for promiscuity, i.e. the cut-off line for how many interaction partners are allowed before a protein has to be considered promiscuous, is an arbitrary number. Low cut-off values for exclusion from the data set will increase the number of reliable interactions in the remaining data, but at the expense of increasing the rate of false negative interactions [22, 39].
Topology: The definition of a cut-off promiscuity value as a criterion to exclude interactions from the data set is problematic. Many proteins have a large number of genuine natural binding partners, and will be erroneously excluded from the network based on their apparent promiscuity. When the interaction network is large enough, or can be integrated with external data sets into an existing larger network, topology metrics can be used to correct for that. These metrics test whether the binding partners of a protein are connected to each other. For example, the number of common binding partners of an interaction pair is a positive indicator of interaction reliability (see Figure 4 for an example) [22]. More complex algorithms that calculate weighted alternative path lengths for protein pairs to derive confidence measures [45, 46], or that score local topologies [47] or clusters [24, 44], have been shown to be useful in the selection of relevant interactions.
Indirect support: Comparisons with external data sets have shown that proteins that bind to each other have a higher than average likelihood to be involved in related cellular functions, are more likely to be expressed at the same time, and to interact genetically with each other [23, 48–51]. These criteria are most useful to assess the overall quality of a data set, and to test the usefulness of selection criteria [52].
Conservation: Lastly, interactions have been shown to be more likely if they are conserved in evolution, as evidenced by paralogous or homologous interacting proteins [24, 39, 48].
LIMITATIONS OF THE Y2H SYSTEM
Many natural protein–protein interactions cannot be detected using the Y2H method. Some proteins do not interact in the environment of the yeast nucleus, such as proteins of the secretory compartments that require oxidative conditions or glycosylation for proper folding. Integral membrane proteins are unlikely to work in the context of reconstituted transcription factor. Many interactions are triggered by post-translational modifications not available in yeast. Other proteins, such as active tyrosine kinases, are toxic to yeast when expressed to high levels, and cannot be used as baits. For these reasons, the rate of interactions not detectable by the Y2H is substantial (e.g. [18, 53]). Rajagopala et al. [54] estimated that their array-based Y2H screens found only 23% of previously known interactions involving motility proteins of Treponema pallidum. When data from another screen in Campylobacter jejuni were added, this fraction rose to 33%. However, many additional interactions were found.
But at least within the limits of the method, it would be desirable that screens be exhaustive, i.e. that they identify all interactions that can be identified by use of the Y2H method. Screens of pooled libraries can only asymptotically approximate saturation. Given that those libraries have complexities of several millions, and weakly expressed proteins are underrepresented, most screens are subsaturating. In contrast, array-based Y2H screens can theoretically be comprehensively screened. However, comparisons of the presently available data sets for yeast (see Figure 3 for an example) [17, 18, 53], fly [22, 31, 32] and man [55] show that in all cases, the overlap of interacting data is minimal, mainly due to the fact that most of the screens are far from exhaustive. Moreover, variations in the details of the Y2H protocol, such as the vectors used, the nature of the re-constituted transcription factor and the libraries screened, have a great impact on the interactions that can be retrieved. Evidently, due to the relatively low detection rate of the Y2H system, other methods will be needed to approach the complete mapping of human proteome, or that of model organisms. Apart from biochemical fractionation of protein complexes followed by mass spectrometry to analyse their components, several other methods may be apt for the task.
ADDING EDGES: ALTERNATIVE PROTEIN INTERACTION ASSAYS
At the time of inception of the Y2H method, arrayed libraries were not available for screening in pair-wise interaction tests. Interacting protein pairs had to be isolated from complex mixtures of proteins or from complex libraries, and one of the great advantages of Y2H screening compared to other interaction tests was its ability to enrich for clones of interacting proteins from a large pool. The availability of ORFeome collections and the development of methods that allow thousands of pair-wise interaction tests in parallel make this advantage somewhat obsolete. Additional methods now become applicable to matrix-type interaction screens, although their advantages or disadvantages will only become clearer when more data is available. Three of them are discussed subsequently.
PROTEIN AND PEPTIDE MICROARRAYS
Microarrays have led to a tremendous parallelisation in the analysis of nucleic acids. For proteins, microarray technology (reviewed in [56]) is still in an earlier stage of development. Problems with the expression, purification, storage and stability of large sets of native proteins still severely hamper progress in the field. To date, proteome-wide arrays useful for protein interaction studies have been generated only for yeast proteins. In a pioneering project, these arrays were used to identify novel calmodulin-binding proteins [57]. Apart from yeast proteins, protein interaction studies using protein microarrays have been centred on particular protein families or domains, such as the SH2 domain [58] or the PDZ domain [59]. Possibly, the use of nucleic acid programmable protein arrays (NAPPA) may provide a route for the cost-effective generation of protein chips useful for the study of protein–protein interactions [60]. For NAPPA chips, DNA molecules are spotted that guide the in situ production of recombinant proteins by a coupled in vitro transcription and translation reaction. The expressed tagged proteins are captured by specific antibodies spotted onto the same spot as the DNA. The immobilised array can be probed for binding with an alternatively tagged soluble protein. In a proof-of-concept experiment, this method has been used to detect protein–protein interactions among 29 human replication initiation proteins [60].
PROTEIN COMPLEMENTATION AND ALTERNATIVE TWO-HYBRID ASSAYS
Using a similar principle as Y2H tests, protein complementation assays (PCAs) use two proteins tagged with two fragments of a reporter protein (reviewed in [61]). Upon interaction of the proteins, the two fragments can reconstitute the active reporter protein, providing a readout for the interaction. A great variety of proteins lend themselves to use in PCAs [61]. Some of them have direct read-outs, such as luciferase enzymes [62, 63], other have indirect read-outs, such as the split ubiquitin ([64] reviewed in [65]) or the split tobacco etch virus (TEV) system [66]. The split ubiquitin system is well matured, and has been applied in a successful effort to map hundreds of protein–protein interactions involving yeast membrane proteins [67]. In the split TEV system, the split enzyme is a protease from TEV. This amino acid motif recognised by this protease is absent in mammalian proteins, such that it can be expressed in a mammalian cell without inflicting damage on the cellular proteins. Activation of TEV causes the liberation of a transcription factor from an inactive complex, which can be read out directly by reporter proteins (Figure 5A). Another alternative two-hybrid assay that has been shown to be amenable to cDNA library screening is the MAPPIT system, which is based on the interaction-dependent activation of STAT transcriptional regulators by a chimeric receptor coupled with transcriptional reporters [68].
QUANTITATIVE CO-PRECIPITATION USING LUCIFERASE-TAGGED PROTEINS
The most straightforward concept to test for a protein–protein interaction is to purify one of the proteins, and test for the presence of the other. In un-biased assays using mass spectrometric analyses of the co-precipitated material, such protocols have been the basis of the discovery of many protein–protein interactions, and used in large-scale projects for yeast [75, 76] and mammalian [77] protein complex mapping. For more straightforward detection of the binding partner, proteins can be fused to more easily detectable proteins, such as luciferase. In this case, pair-wise interactions are tested in dedicated assays. Barrios-Rodiles et al. [69] miniaturized this assay and applied it to the analysis of protein–protein interactions in TGF-β signal transduction. This method is quick and cost-effective enough to allow for proteome-wide interaction screens. As in the MAPPIT and the split TEV system, interactions are isolated from a physiological environment, which is beneficial when interactions need to be tested after a regulatory event, e.g. cytokine stimulations (Figure 5B).
OUTLOOK
Despite a large number of interactions deposited in specialised databases there is still no complete interactome available for any organism. Driven by current efforts in large-scale Y2H screening, the availability of ORFeome collections and novel methods for the detection of protein–protein interactions, we expect such interaction maps to become available for a few model organisms in the near future. Overlapping protein interaction data sets gathered by independent methods will increase the confidence in those interactions that are detected by more than one method. Large-scale affinity purification projects coupled with mass spectrometry analyses will also complement the map of protein interactions and elucidate the composition of complexes which are stable enough to survive the purification processes.
Prior to the introduction of Y2H screening, identifying a potential interaction has been the rate-limiting step in many projects, and the minimal merit of the method is that the rate-limiting step has been shifted to confirming an interaction's significance. With the availability of confirmed protein interaction data in public databases, this obstacle will be removed as well, and the rate-limiting step will be shifted towards understanding the biological function of the interactions.
Large-scale Y2H screening projects are currently used to build the first proteome-wide binary protein–protein interaction maps.
Availability of proteome-wide repositories of expression clones facilitates protein interaction screens by Y2H and other methods such as automated co-purification assays.
Statistical filtering of large protein interaction data sets allows to define high-confidence protein interaction data.
Novel methods are waiting in the wings and will increasingly contribute to the comprehensive mapping of protein–protein interactions.
Acknowledgements
We are grateful to Frank Schwarz., Christian Maercker, Gerald Nyakatura and Dirk Kuck for critical reading of the manuscript and to Bernd Korn, Ralf Tolle and Joachim Uhrig for helpful discussions. This project has been supported by the Nationale Genomforschungsnetz (PSR-S19T039).