1. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1Roots Tech Rule Your Genome, Democratize Health 3 Feb 2016
@dinazielinski
@dl1dl1
Dina Zielinski
5. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
If it takes a village to raise a child…
…it takes the world to help a child
with a genetic disorder
image: Victor Ngai
6. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
feedback reciprocity
genealogy
crowd sourcing
data sharing
7. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
https://dna.land
- Free to use
- Not for profit
- Run by scientists
12. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
legalgenealogist.com
“…from the standpoint of the rules of
the road, there’s no reason not to
consider playing in the DNA.Land
playground.”
Judy G. Russell, JD
13. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
feedback reciprocity
genealogy
crowd sourcing
data sharing
23. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
imputation
26 possible solutions
8 make actual words
1 probable solution
You had a blue ca_ on your head.p
You had a blue ca_ yesterday.
b,n,p,r,t?
24. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Power to detect recent common ancestry between pairs of individuals
known to be related at varying degrees.
Chad D. Huff et al. Genome Res. 2011;21:768-774
26. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Relatives of relatives
Dena
Dena@dna.land
bruce@dna.land
Bruce
Cherie
Cherie@dna.land
Cherie@dna.land
Cherie
Relative
of relative
Actual
match
30. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
DNA.Land early adopters
days
31. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
CeCe Moore
genetic genealogist
Carl Zimmer
science writer
Henry Louis Gates, Jr.
historian/journalist
AJ Jacobs
journalist/author
32. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
info@dna.land
We value your feedback!
Richard Aufrichtig
support specialist
33. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
From:
To: info@dna.land
Subject: Match relationships
on my DNA match predictions I thought it might be helpful to you to know
my cousin and I are estimated to be cousins, and
we are, in fact, cousins once removed. We have a documented
paper trail with many cousin marriages, so it is a case of endogamy.
Thanks for your ongoing research!
From:
To: info@dna.land
Subject: Ethnic Feedback
Hey, I just wanted to give ethnic feedback to help improve the
ancestry algorithm, at least, if the ethnic data Ancestry gave me from
the Autosomal Test is any help.
The regions are as follows:
% Europe West
% Scandinavia
% Ireland
% Italy/Greece
% Iberian Peninsula
<1% Caucasus
From: :
To: info@dna.land
Subject: Feedback
Both and I have published our data on gedmatch.com, and
ftdna puts us in the to cousin range with shared
cM, with a longest block of . [...]
I know that the different companies use different defaults of cMs and
other data for comparison. It will be interesting to know what you find
out in comparing our data.
Thank you very much.
34. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
facebook.com/knowyourgenome/
35. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
feedback reciprocity
genealogy
crowd sourcing
data sharing
36. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Social media paradigm for large pedigrees
>40 million public profiles
IRB approval
Geni approval
37. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
genealogy
crowd sourcing
data sharing
38. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Cleaning the graph
Ideal: What we see (0.4%):
>2 parents
Cycles
Each union should contain up to 2 individuals.
Biologically impossible situations…
Union
Individual
39. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Cleaning the graph in three steps
Pre-graph
Cycle
removal
Merging
nodes
Removing
illegal nodes
Graph
Mike
Ann Al Anni
Bert Betty
Charlie
Ed
Fred
Bert
Charlie Diane
Eddie
Frank
Victor
Brad
ChrisSamHillary
40. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Cleaning the graph: removing illegal nodes
Pre-graph
Cycle
removal
Merging
nodes
Removing
illegal nodes
Graph
Mike
AnnAl
BertBetty
Charlie
Ed
Fred
Bert
Charlie Diane
Eddie
Frank
41. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Can we obtain large family trees?
Individual
Marriage
A family tree with 6000
people Family tree of 13 million people…
1440 px
900px
~1 million px
70,000 (0.5% of the
data)
42. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Validation using genetic markers
Maternal line (mito) Paternal line (y-chr)
Total edges
(meioses)
1768 324
Mismatches 5 6
Error rate per
edge
0.3% 2.0%
Andreson, 2006
44. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Longevity: further validation
10 20 30 40 50 60 70 80 90 100
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
1910
HMD
Geni
Age of death
45. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
45
50
55
60
65
70
75
80
1840 1890 1940 1990
Oeppen et al., Science, 2002
45
50
55
60
65
70
75
80
1840 1890 1940 1990
Our resource
Year of death
Avg.lifespan
Validating life expectancy
R2=0.96
46. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Use case: the genetics of longevity
MZ Twins
Sibs
2nd cousins
3rd-5th cousins
Avuncular
1st cousins
Relatives from
consanguineous
marriages
>1 million Geni profiles with date of birth and death
47. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Big data visualization: lifespan
1550 1600 1650 1700 1750 1800 1850 1900 19500
Year of birth
45
1840 1890 1940
Yearof Death
40 5
1650
1700
1750
1800
1850
1900
1950
2000
-1-2-3
Fraction of pro les/year [log10]
1600
Yearofdeath
Lifespan
C
2
4
10 20 30 40 50 60 70 80 90 100
%prolesOverall:
2
6
10
14
2
6
10
14
2
6
10
14
2
6
10
20 40 60 80 100
2
4
6
0
5000
10000
#pro les Comparing Geni to HMD
0
Geni
HMD
0
0
0
0
QQ plotHistograms
Age of death
48. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Environment: location
event + location # events
BIRTH 7,352,478
RESIDENCES 1,667,895
DEATH 1,492,908
BURIAL 314,344
“died in infancy, Upshur, (West) Virginia”
“Санкт-Петербург, Россия”
Examples:
How to convert free text into GPS coordinates?
“Санкт-Петербург, Россия”
Lat:59.9408
Long:29.6728
Quality: 10
Yahoo! Geoparser
53. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Year of birth
Where is the love of your
life?
Quantitative anthropology
Year of birth
Who is the love of your life?
~4th
cousins
54. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
genealogy
crowd sourcing
data sharing
55. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
13,900 genomes and counting!
genealogy
crowd sourcing
data sharing
It takes more than a
village
57. Yaniv Erlich2/3/16 Rule Your Genome, Democratize Health @dl1dl1
Acknowledgements
DNA Land
Yaniv Erlich
Joe Pickrell
Assaf Gordon
Jie Yuan
Tristan Hayeck
Richard Munoz
Mary Wahl
Kevin Shi
Nathan Pearson
Robert Aboukhalil
Goldenhar Syndrome
Barak Marcus
Mona Sheikh
Balaji Srinivasan
Clement Chu
Melissa Gymrek
Dror Aizenbud
Funding
Burroughs Wellcome Career Award
Broad Institute SPARC Award
Andria and Paul Heafy
Whitehead Institute
Geni
Joanna Kaplanis
Assaf Gordon
Mary Wahl
Mickey Gershovits
Mona Sheikh
Barak Marcus
Pratheek Nagaraj
Alkes Price
Daniel MacArthur
Notes de l'éditeur
Reason I’m sharing this story is because this study is just one small scale example of the importance of data sharing.
What is imputation?
-Common words are easier to impute. Rare words are hard, as with genetic variants.
-variable success rates for diff populations
DNA.Land is nothing without use
-responses on social media crucial to improving and refining our algorithms.
We are indebted to these early adopters and to anyone who sends feedback. You, the users are our most valuable asset.
We’re listening!
1. relative matching: feedback from users help us validate our relative-matching algorithms, and also unearth interesting family structures
2. ancestry: feedback about the expected ethnicity help us to validate and improve the Ancestry detection algorithm
3. segment sharing: some users even go as far as providing us with their results from other websites - and that is truly helpful in refining our pipeline parameters. We are very grateful for those
Collecting data is not enough. How can we up our game and what can we do with the data?
Note: this was ONLY publicly available data that was approved by both GENI and our IRB and which is available online
We’ve implemented lessons we learned from previous work in DNA land that converged on these 3 topics
Data is very noisy.
1. clean it and 2. do some validation before we can draw any sort of conclusions
After we clean the data we have this enormous pedigree. Is it correct?
1st validation step= what is obama’s bacon #? 6th cousin twice removed
1. Record hyping
Cleaned data but is it correct?
Geni profile concordance with known geographic settlements
place of birth distances (log scale) between sibs, cousins, parent-child
take home = 5th cousins; <1000km away. people don't move that much
We saw how to actually clean and validate crowd-sourced data from >40 million public profiles
data is scientifically usable