ABSTRACT
The last ten years have seen an exponential growth of direct-to-consumer genomics. One popular feature of these tests is the report of a distant ancestral inference profile-a breakdown of the regions of the world where the test-taker's ancestors may have lived. While current methods and products generally focus on the more distant past (e.g., thousands of years ago), we have recently demonstrated that by leveraging network analysis tools such as community detection, more recent ancestry can be identified. However, using a network analysis tool like community detection on a large network with potentially millions of nodes is not feasible in a live production environment where hundreds or thousands of new genotypes are processed every day. In this study, we describe a classification method that leverages network features to assign individuals to communities in a large network corresponding to recent ancestry. We recently launched a beta version of this research as a new product feature at AncestryDNA.
Supplemental Material
- D. H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19:1655--1664, 2009. Google ScholarCross Ref
- Ancestry Corporate Communications. Ancestry Sets AncestryDNA Sales Record Over Holiday Period and Fourth Quarter. Press Release available at: http://www.ancestry.com/corporate/newsroom/press-releases/ancestry-sets-ancestrydna-sales-record-over-holiday-period-and-fourth, 2017.Google Scholar
- C. Ball, et al. AncestryDNA Matching White Paper: Discovering genetic matches across a massive, expanding database. Ancestry. Available at: https://www.ancestry.com/corporate/sites/default/files/AncestryDNA-Matching-White-Paper.pdfGoogle Scholar
- V. D. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 10(P10008), 2008. Google ScholarCross Ref
- S. R. Browning and B. L. Browning. Haplotype Phasing: existing methods and new developments. Nature Reviews Genetics 12:703--714, 2011. Google ScholarCross Ref
- C. Chen, A. Liaw, L. Breiman. Using Random Forest to Learn Imbalanced Data. Statistics Technical Reports 666, 2004.Google Scholar
- G. Csárdi and T. Nepusz. The Igraph Software Package for Complex Network Research. InterJournal Complex Systems 1695, 2006.Google Scholar
- G. Forman and M. Scholz. Apples-to-Apples in Cross-Validation Studies: Pitfalls in Classifier Performance Measurement. SIGKDD Explorations: 12(1), 2010. Google ScholarDigital Library
- S. Fortunato. Community detection in graphs. Physics Reports, 486:3--5:75--174, 2010.Google ScholarDigital Library
- M. Girvan and M. E. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12): 7821--7826, 2002. Google ScholarCross Ref
- R. C. Griffiths and S. Tavare. The age of a mutation in a general coalescent tree. Commun. Statist-Stochastic Models, 14 (1&2), 273--295, 1998. Google ScholarCross Ref
- A. Gusev et al. Whole population genome wide mapping of hidden relatedness. Genome Research, 2008. Google ScholarCross Ref
- E. Han et al. Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nature Communications 8, 2017. Google ScholarCross Ref
- Illumina. Omni Whole-Genome DNA Analysis BeadChips. https://www.illumina.com/content/dam/illumina-marketing/documents/products/datasheets/datasheet_omni_whole-genome_beadchips.pdf, 2017.Google Scholar
- D. J. Lawson, G. Hellenthal, S. Myers, and D. Falush. Inference of population structure using dense haplotype data. PLoS Genetics 8(e1002453), 2012. Google ScholarCross Ref
- S. Leslie et al. The fine-scale genetic structure of the British population. Nature 519:309--314, 2015. Google ScholarCross Ref
- B. K. Maples, S. Gravel, E. E. Kenny, and C. D. Bustamante. RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. American Journal of Human Genetics 93(2), 278--288, 2013. Google ScholarCross Ref
- Moreno-Estrada et al. The Genetics of Mexico Recapitulates Native America Substructure and Affects Biomedical Traits. Science 344:1280--1285, 2014. Google ScholarCross Ref
- M. Nei. Genetic Distance between populations. Am. Nat. 106: 283--292, 1972. Google ScholarCross Ref
- M. E. Newman. The structure and function of complex networks. SIAM Review 45(2):167--256, 2003. Google ScholarDigital Library
- R. Nielsen, J. M. Akey, M. Jakobsson, J. K. Pritchard, S. Tishkoff, and E. Willerslev. Tracing the peopling of the world through genomics. Nature 541: 302--310, 2017. Google ScholarCross Ref
- K. Noto et al. Underdog: A Fully-Supervised Phasing Algorithm that Learns from Hundreds of Thousands of Samples and Phases in Minutes. Presented at the 64th Annual Meeting of the American Society of Human Genetics, 2014.Google Scholar
- J. K. Pritchard, M. Stephens, P. J. Donnelly. Inference of population structure using multilocus genotype data. Genetics 155:945--959, 2013.Google ScholarCross Ref
- J. S. Roberts et al. Direct-Consumer Genetic Testing: User Motivations, Decision Making, and Perceived Utility of Results. Public Health Genomics, 2017. Google ScholarCross Ref
- US Census Bureau. 2010 Census Shows Multiple-Race Population Grew Faster Than Single-Race Population, https://www.census.gov/newsroom/releases/archives/race/cb12--182.html, 2012.Google Scholar
Index Terms
-
Estimation of Recent Ancestral Origins of Individuals on a Large Scale
-
Recommendations
-
Reconstructing the architecture of the ancestral amniote genome
Motivation: The ancestor of birds and mammals lived approximately 300 million years ago. Inferring its genome organization is key to understanding the differentiated evolution of these two lineages. However, detecting traces of its chromosomal ...
Comments