research-article

Open Access

Estimation of Recent Ancestral Origins of Individuals on a Large Scale

Authors:
Ross E. Curtis

AncestryDNA, Lehi, UT, USA

AncestryDNA, Lehi, UT, USA
View Profile

,
Ahna R. Girshick

AncestryDNA, San Francisco, CA, USA

AncestryDNA, San Francisco, CA, USA
View Profile

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2017Pages 1417–1425https://doi.org/10.1145/3097983.3098042

Published:13 August 2017Publication History

Get Citation Alerts

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.
Manage my Alerts

New Citation Alert!

Please log in to your account
Publisher Site

eReader
PDF

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1417–1425

ABSTRACT

The last ten years have seen an exponential growth of direct-to-consumer genomics. One popular feature of these tests is the report of a distant ancestral inference profile-a breakdown of the regions of the world where the test-taker's ancestors may have lived. While current methods and products generally focus on the more distant past (e.g., thousands of years ago), we have recently demonstrated that by leveraging network analysis tools such as community detection, more recent ancestry can be identified. However, using a network analysis tool like community detection on a large network with potentially millions of nodes is not feasible in a live production environment where hundreds or thousands of new genotypes are processed every day. In this study, we describe a classification method that leverages network features to assign individuals to communities in a large network corresponding to recent ancestry. We recently launched a beta version of this research as a new product feature at AncestryDNA.

Supplemental Material

curtis_ancestral_origins.mp4

mp4

375.5 MB

Download

References

D. H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19:1655--1664, 2009. Google ScholarCross Ref
Ancestry Corporate Communications. Ancestry Sets AncestryDNA Sales Record Over Holiday Period and Fourth Quarter. Press Release available at: http://www.ancestry.com/corporate/newsroom/press-releases/ancestry-sets-ancestrydna-sales-record-over-holiday-period-and-fourth, 2017.Google Scholar
C. Ball, et al. AncestryDNA Matching White Paper: Discovering genetic matches across a massive, expanding database. Ancestry. Available at: https://www.ancestry.com/corporate/sites/default/files/AncestryDNA-Matching-White-Paper.pdfGoogle Scholar
V. D. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 10(P10008), 2008. Google ScholarCross Ref
S. R. Browning and B. L. Browning. Haplotype Phasing: existing methods and new developments. Nature Reviews Genetics 12:703--714, 2011. Google ScholarCross Ref
C. Chen, A. Liaw, L. Breiman. Using Random Forest to Learn Imbalanced Data. Statistics Technical Reports 666, 2004.Google Scholar
G. Csárdi and T. Nepusz. The Igraph Software Package for Complex Network Research. InterJournal Complex Systems 1695, 2006.Google Scholar
G. Forman and M. Scholz. Apples-to-Apples in Cross-Validation Studies: Pitfalls in Classifier Performance Measurement. SIGKDD Explorations: 12(1), 2010. Google ScholarDigital Library
S. Fortunato. Community detection in graphs. Physics Reports, 486:3--5:75--174, 2010.Google ScholarDigital Library
M. Girvan and M. E. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12): 7821--7826, 2002. Google ScholarCross Ref
R. C. Griffiths and S. Tavare. The age of a mutation in a general coalescent tree. Commun. Statist-Stochastic Models, 14 (1&2), 273--295, 1998. Google ScholarCross Ref
A. Gusev et al. Whole population genome wide mapping of hidden relatedness. Genome Research, 2008. Google ScholarCross Ref
E. Han et al. Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nature Communications 8, 2017. Google ScholarCross Ref
Illumina. Omni Whole-Genome DNA Analysis BeadChips. https://www.illumina.com/content/dam/illumina-marketing/documents/products/datasheets/datasheet_omni_whole-genome_beadchips.pdf, 2017.Google Scholar
D. J. Lawson, G. Hellenthal, S. Myers, and D. Falush. Inference of population structure using dense haplotype data. PLoS Genetics 8(e1002453), 2012. Google ScholarCross Ref
S. Leslie et al. The fine-scale genetic structure of the British population. Nature 519:309--314, 2015. Google ScholarCross Ref
B. K. Maples, S. Gravel, E. E. Kenny, and C. D. Bustamante. RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. American Journal of Human Genetics 93(2), 278--288, 2013. Google ScholarCross Ref
Moreno-Estrada et al. The Genetics of Mexico Recapitulates Native America Substructure and Affects Biomedical Traits. Science 344:1280--1285, 2014. Google ScholarCross Ref
M. Nei. Genetic Distance between populations. Am. Nat. 106: 283--292, 1972. Google ScholarCross Ref
M. E. Newman. The structure and function of complex networks. SIAM Review 45(2):167--256, 2003. Google ScholarDigital Library
R. Nielsen, J. M. Akey, M. Jakobsson, J. K. Pritchard, S. Tishkoff, and E. Willerslev. Tracing the peopling of the world through genomics. Nature 541: 302--310, 2017. Google ScholarCross Ref
K. Noto et al. Underdog: A Fully-Supervised Phasing Algorithm that Learns from Hundreds of Thousands of Samples and Phases in Minutes. Presented at the 64th Annual Meeting of the American Society of Human Genetics, 2014.Google Scholar
J. K. Pritchard, M. Stephens, P. J. Donnelly. Inference of population structure using multilocus genotype data. Genetics 155:945--959, 2013.Google ScholarCross Ref
J. S. Roberts et al. Direct-Consumer Genetic Testing: User Motivations, Decision Making, and Perceived Utility of Results. Public Health Genomics, 2017. Google ScholarCross Ref
US Census Bureau. 2010 Census Shows Multiple-Race Population Grew Faster Than Single-Race Population, https://www.census.gov/newsroom/releases/archives/race/cb12--182.html, 2012.Google Scholar

Index Terms

Estimation of Recent Ancestral Origins of Individuals on a Large Scale
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Recommendations

Reconstructing contiguous regions of an ancestral genome

Read More
Ancestral genome reconstruction in bacteria

Read More
Reconstructing the architecture of the ancestral amniote genome

Motivation: The ancestor of birds and mammals lived approximately 300 million years ago. Inferring its genome organization is key to understanding the differentiated evolution of these two lineages. However, detecting traces of its chromosomal ...

Read More

Comments

comments powered by Disqus.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2017

2240 pages

ISBN:9781450348874

DOI:10.1145/3097983

General Chairs:

Stan Matwin
Dalhousie University
,

Shipeng Yu
LinkedIn
,

Faisal Farooq
IBM
Copyright © 2017 Owner/Author

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher

Association for Computing Machinery

New York, NY, United States
Publication History
- Published: 13 August 2017
Check for updates
Author Tags
applied machine learning

community detection

computational genomics

random forest classification

scalability of large systems
Qualifiers
- research-article
Conference

Acceptance Rates

KDD '17 Paper Acceptance Rate64of748submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%

More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics

View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 1,075
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)5
Other Metrics

View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Estimation of Recent Ancestral Origins of Individuals on a Large Scale

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Reconstructing contiguous regions of an ancestral genome

Ancestral genome reconstruction in bacteria

Reconstructing the architecture of the ancestral amniote genome

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Estimation of Recent Ancestral Origins of Individuals on a Large Scale

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Reconstructing contiguous regions of an ancestral genome

Ancestral genome reconstruction in bacteria

Reconstructing the architecture of the ancestral amniote genome

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media