Introduction

Dense single nucleotide polymorphisms (SNP) data sets provide an immense opportunity for studying population genetics, allowing the development of catalogs for signatures of selection, structural variants, and haplotype assortment. In particular, the HapMap project, launched in 2002, provides a public database of common SNPs across the genome in hundreds of individuals from geographically diverse populations. There have been multiple genome-wide scans attempting to identify recent and ongoing regions of selection using HapMap data.1, 2, 3, 4, 5, 6 However, such studies tend to only report the most extreme outliers, and it is therefore conceivable that there are regions omitted from these published reports. Furthermore, it has been suggested that regions identified in multiple analyses are more likely to be true positives.6, 7

Regions of the genome that are part of ongoing selective sweeps may include loci involved in phenotypic outcomes, including both deleterious and protective variants. Thus, haplotypes across regions with evidence of recent selection are of interest to be compared with association signals in genetic mapping.4 The incorporation of population genetic concepts into genetic research may therefore be important. Understanding patterns of genetic variation and evolutionary selection in the human genome may help to shed light on genetic epidemiology studies.

In general, there are five types of established tests being used for identifying regions of selection based on varied signatures produced by natural selection,8 they include the function-altering mutations test, the heterozygosity test, as well as the derived allele test,9, 10 the population differential test,5, 11 and the long haplotype test.4, 5, 12, 13 The time scale for each type of test reduces, respectively. We are most interested in genome-wide scan methods for detecting the last three types of selection signatures, since they could be used as an aid for disease gene mapping.

The composite likelihood ratio (CLR)9 test is a type of derived allele test. The CLR test searches for a signature of recent ‘complete’ selective sweep using pooled data from various populations. This test applies genomic window approach, comparing spatial patterns of allele frequencies against the whole genome and coalescent simulation model.9, 10 The long haplotype tests have several variations, including long-range haplotype (LRH) test,12 haplotype similarity, and other haplotype-sharing methods. The extended haplotype homozygosity (EHH) test is a type of LRH test that is based on extended haplotype length and high linkage disequilibrium. Two extensions have been made to EHH: the integrated Haplotype Score (iHS) test4 and the cross population extended haplotype homozygosity (XP-EHH) test, and both of which have been used extensively.5 The iHS detects alleles that have risen to median-high frequency, and XP-EHH identifies selected alleles that have risen to near fixation in one but not all populations. XP-EHH is also considered as a population differential test. Lastly, a composite of multiple signals (CMS)13 test that incorporates iHS, XP-EHH, Wright's F test (FST)11 and two additional tests, could be utilized to localize regions of selection to a more precise position.

Among these kinds of tests, the long haplotype tests are the most similar to our shared genomic segment (SGS) approach. The EHH methods detect regions of identity-by-descent through measuring LD at a distance from the core region using calculation of extended haplotype homozygosity. Our method works directly on unphased data to identify regions of reduced genetic diversity, in which the population dynamic gave rise to the reduction and produced regions of long haplotypes.

In this study, we have implemented our SGS method to identify regions of complete, or almost complete selective sweeps across the human genome. The original SGS algorithm was introduced by Thomas et al.14 In this study, we incorporate an error model to provide additional robustness to possible genotyping error in the data. We apply this new algorithm to investigate recent adaptive evolution in populations from three continents, using the second-generation HapMap data sets. Our SGS approach concentrates on identifying the longest runs of heterozygous sharing within populations, considering outlier regions as areas of possible selection. We compare our putative regions of selection with those identified by other methods and note those that are novel and replicated, as well as provide an ontological analysis of the genes in these regions, and finally list those that have been related to disease phenotypes.

Materials and methods

Data sets

We retrieved HapMap phase II October 2008 build release 24 genotype data sets,15 with alleles expressed in the forward strand of the reference human genome (NCBI build 36/UCSC hg18). We analyzed three populations: 90 unrelated individuals from two Asian panels from Tokyo, Japan and Beijing, China (ASN); 60 unrelated parents from 30 Utah trios with ancestry from Northern and Western Europe (CEU); and 60 unrelated parents from 30 Yoruba trios from Ibadan, Nigeria (YRI). The SNP genotyping data sets included over 3.7 million markers in each population: 3 879 488 SNPs (ASN); 3 849 034 SNPs (CEU); and 3 762 311 SNPs (YRI).

Shared genomic segment analysis

Consider s SNPs (with alleles coded as 1 and 2) genotyped on n individuals. Define ni11, ni22 as the number of individuals that are homozygous for alleles 1 and 2 at locus i, respectively. Define the variable Si as Si = n − min ,(n11i, n22i), which indicates the number of individuals from the total n that share at least one allele identical-by-state (IBS) at locus i. Alleles are considered IBS if they are the same type, irrespective of whether they are inherited from a recent ancestor. Missing genotypes are treated as heterozygote. Define Ri(n) as the length of the longest tract of consecutive loci containing the ith SNP at which Si=n, that is, where all of the n individuals share at least one allele IBS.

Equalizing the population sample sizes

Because of varying sample sizes in each population, we used a re-sampling method to generate consistent sample sizes of 45 individuals for each population. Specifically, we randomly sampled 45 individuals with replacement from a population and performed an SGS analysis to obtain Ri(45). In each population, we repeated the sampling 10 times and averaged the Ri(45) values for all i across the genome .

SGS error model

For genome-wide SNP genotyping assays, the estimated average per genotype accuracy is greater than 99%. However, across many samples and millions of densely genotyped SNPs, the presence of errors becomes non-negligible. For SGS analysis, a single genotyping error creating a homozygous genotype could erroneously terminate a tract of shared SNPs. Assuming a random, uniform, and independent error rate of ɛ per SNP, the number of falsely genotyped SNPs in an interval of length δ SNPs is a Poisson random variable with parameter λ=2δɛ for each individual's genome.16 In this study we use ɛ=0.01, allowing an average of one potentially erroneous non-sharing SNP per 100 SNPs in a SGS run involving n individuals. In other words, we will incorporate SNPs that are not shared by all n individuals (that is, Si<n) into the shared SGS run if this results in a non-sharing error rate of at most 1 SNP per 100. This is equivalent to a maximum genotyping error rate of 0.1% based on our sample size of 45 individuals. We also require no more than two consecutive erroneous SNPs. That is, if the non-sharing between two adjacent runs of SGS comprises no more than two SNPs and if considering them as errors leads to an error rate less than ɛ=0.01, then the two segments are merged. We post-processed the original SGS run lengths, Ri(45), to create a new run length, Oi(45), that incorporates the error modeling (averaging across 10 re-sampled data sets per population, Ōi (45)).

Comparison with other methods of positive selection analysis

We compared our putative regions of natural selection (outlier values of ÅŒi (45) for each population) to those published using different population genetic methods (iHS4, 6, XP-EHH5, 6, CLR9, 10, and CMS13) and HapMap data sets.

Gene ontology analysis

The BiNGO tool17 (http://www.psb.ugent.be/cbd/papers/BiNGO/) was used to perform a Gene Ontology (GO) analysis on genes residing in potential regions of selection from each population. A binomial test was used to compute P-values, and the Benjamini–Hochberg false discovery rate method was applied to account for multiple testing. We selected those GO terms that include at least four genes, with P-value cut-off of 0.05, and Benjamini–Hochberg false discovery rate q-value cut-off of 0.2. Furthermore, we retained those GO terms in the leaf node, when there are multiple GO terms account for the same cluster of genes.

Results

We performed SGS with our error model in each of the three populations described above: ASN, CEU, and YRI. The mean physical lengths of the SGS distribution (basepairs determined from Ōi (45)) for these populations were 2.1 kb for ASN, 2.02 kb for CEU, and 1.83 kb for YRI. It is known that both CEU and ASN have experienced recent population bottlenecks, whereas the YRI population may have experienced less recent selective forces, and consequently the average SGS sharing within ASN and CEU are longer than that seen in YRI.

Outlier regions were identified as those residing in the extreme tail of the distribution for each population. Based on the shapes of SGS length distribution, we determined SGS run length thresholds of 500 SNPs for ASN, 350 SNPs for CEU, and 100 SNPs for YRI, which correspond to approximately the top 0.1–0.2% of each distribution. Figure 1 shows the Ōi (45) SGS run lengths observed in the three populations. In total, 20 regions were defined as outliers (11 in ASN, 7 in CEU, and 2 in YRI). Sixteen of these regions have additional supporting evidence from the literature,4, 5, 6, 9, 10, 13, 15 and the remaining four are novel, including both of the YRI regions. Table 1 shows the details of all the regions we identified, including the following categories: what population the region was identified in; whether the region had previously been proposed, and all reference sequence genes that reside in the region. The centimorgan distance was calculated based on HapMap provided recombination rate and distance.

Figure 1
figure 1

SGS observation in three HapMap populations. Y-axis is ÅŒi (45). (a) HapMap CEU data set. (b) HapMap ASN data set. (c) HapMap YRI data set. The color reproduction of this figure is available on the html full text version of the manuscript.

Table 1 Regions of extreme sharing identified by SGS

In what follows we will assess the novel regions we identified as outliers in our SGS analysis. A region on chromosome 1, containing the gene PDE4DIP, is a novel region of potential selection that were identified in both CEU and YRI populations, and hence this region may correspond to selection before the ‘out of Africa’ event. In addition, the PDE4DIP gene has found to be associated with esophageal squamous cell carcinoma (SCC).18

A juxtacentromeric region on chromosome 21 that have previously been identified as a potential region of selection in CEU by XP-EHH,5 was identified in both CEU and YRI by our SGS method. The common region across the two populations contains two genes: BAGE2 and BAGE. The BAGE (B melanoma antigen) gene family has been shown to be under selective pressure, with BAGE proteins considered as a potential birth site of novel genes.19 In our analysis, the region extended to TPTE gene in the YRI population. Both TPTE and BAGE2 are associated with Robertsonian Down syndrome.20

The chromosome 7 region is a novel identification in ASN, but has been identified as a region with very strong positive selection in a study using European–American sequencing samples.21 Four genes in the region, EPHG6, TRPV5, TRPV6, and KEL, were suggested to be under demographical selection in European–Americans. The genes EPHG6 and KEL were identified as candidate genes for positive selection, according to evidence of an excessive amount of rare alleles and high frequency derived alleles. A skewed polymorphism was detected in TRPV6, which also suggests of recent positive selection.22 Our results extend these findings to the ASN population. It is possible that the migration of modern humans out of Africa into Asia have been accompanied by genetic adaptations similar to that experienced by Europeans.

A total of 50 genes from regions of selection in the ASN population, 23 genes from CEU, and 4 genes from YRI were used for ontological analysis. It should be noted that there is a lack of power because of an inadequate number of genes in ontology categories for all three populations. Significant GO analysis results were only found in the ASN population, as is illustrated in Table 2.

Table 2 Gene Ontology Analysis among Genes in the Regions of Selection in ASN

Discussion

We have studied three diverse populations using the HapMap phase II data set using an SGS algorithm that incorporates an error model. On the basis of outlier regions, we detected 16 genomic regions that have previously been proposed as under the influence of natural selection, demonstrating the capacity of SGS and lending more weight to those regions. In addition we have proposed four novel regions of potential selection, one of which was observed in two of the three populations studied. Because regions of selection likely contain functionally significant polymorphisms for human genes, the mapping of regions illustrating selection could aid in the effort to identify important genetic factors in recent human development or disease resistance. Also, variants that are not under direct selection, but hitch-hiking due to their proximity to the selection point will also occur on the shared segment, and may become increasingly important genetic factors in complex, polygenic diseases.

In the regions identified, we found genes from functional pathways that are known to be under selection, including pathways of immune response (KEL and P4HTM), tumor antigen (BAGE), and spermatogenesis (CYP26B1 and LAMB2), as well as sensory perception (COL11A1). In our ontological analysis, it is shown that biological processes, such as signal transduction, protein metabolism and modifications, as well as cell motility are enriched, similar as it was illustrated in other selection studies.4, 13 Furthermore, among the genomic regions identified by SGS as potential for selection, there are several genes that have been identified to be associated with complex disorders. For example, four musculoskeletal disease genes (COL11A1, DAG1, DTNA, and SLC25A20), and genes associated with disorder of oxidative phosphorylation system (MRPS16 and NDUFAF3). Several genes associated with brain development and neurological diseases were also found to be located in the SGS regions, which include 14 genes shared among ASN samples (AMT, BSN, MST1, CELSR3, CYP26B1, EMX1, EPHB6, EVI5, GFI1, GPX1, LAMB2, RHOA, VSNL1, and WDR6); and KIAA0319 L and SLC25A46 in CEU; as well as one gene (PDE4DIP) found in both CEU and YRI. In addition, there are several cancer disease genes located in our regions of selection, such as tumor suppressor gene (ANXA7), colorectal cancer gene (MLL3), and genes associated with multiple sclerosis (GFI1 and EVI5), as well as genes or gene products that are associated with multiple cancers, such as GPX1, RHOA, and APEH. In addition, the 3p21.31 region identified in ASN is harbored within a region associated with Crohn's disease. These observations are consistent with of the hypothesis that genes involved in common disease could often be targets of selection.

Our SGS method is best suited to identifying complete, or near complete, selective sweeps, and therefore our results are most consistent with those from CLR and XP-EHH, which also have strengths for such sweeps. On the other hand, our results were less consistent with iHS, because the iHS method is designed to detect regions of selection before fixation or that have undergone balancing selection. We note that SNP ascertainment bias is an issue that hinders all methods from detecting regions of selection. Such bias in SNP selection leads to genotype data that is deviated from its normal level of variability, distribution of allele frequencies, and levels of linkage disequilibrium. This issue will be addressed when sequence data are available for each population.

In conclusion, we have demonstrated that our SGS method is capable of identifying regions of selection using dense SNP data set. Given that functional allelic variants are subject to purifying selection, this ubiquitous selection creates rare variants for functional alleles, which is considered to be a driving force of complex traits. The ability of SGS in detecting selection regions aids the search for disease susceptibility loci.