SlideShare a Scribd company logo
1 of 44
J. B. Cole
Animal Improvement Programs Laboratory
Agricultural Research Service, USDA
Beltsville, MD 20705-2350
john.cole@ars.usda.gov
Data Structures and Visualization
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (2) Cole
Introduction
• We’re drowning in information
• Genetics are viewed as a commodity
• We need to get better data from
fewer cows
• Do we have the resources we need?
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (3) Cole
U.S. dairy population
0
5
10
15
20
25
30
40 50 60 70 80 90 00
Year
Cows(millions)
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (4) Cole
We need to do more with less
• 47% of U.S. dairy cows are enrolled
in DHIA testing
• The Class III milk is $17/cwt
• Grain prices are very high
 Corn averaged $6/bu in May
 Soybeans averaged $13/bu in May
• Enrollment and cow numbers are
unlikely to increase
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (5) Cole
Major topics
• Different sources of data
• Data source integration and quality
• Data mining models
• Visualization examples
• Computational resources
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (6) Cole
Data currently in national database
• Identification and registration
• Conformation scores
• Milk production and composition
• Fertility
• Longevity
• Some genotypes
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (7) Cole
What are big data?
Type of Record Number of Records1
Cows with lactation data 28,394,976
Lactations 68,373,863
Individual test days 508,574,732
Calving ease records 20,770,758
Animals in pedigree file 58,893,009
Bull genotypes 50,393
Cow genotypes 70,687
1Totals include animals from all breeds.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (8) Cole
Data not routinely available
• Farm and herd management
 Geography and climate
 Housing systems
 Feed intake
• Milk composition
 Milk fats, proteins, vitamins, minerals
 Conductivity, lactose, MUN
• DNA data
 Cow SNP genotypes, DNA sequence data
Photo: NOAA
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (9) Cole
Data “trapped” on the farm
• Fertility
 Insemination information
 Use of estrus synchronization
• Cow health and longevity
 Body condition scores
 Birth weights and mature weights
 Disease occurrence data
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (10) Cole
Electronic milk meters
• Currently can provide—
 Milk yield
 Milking speed
 Electrical conductivity
• May possibly supply—
 Progesterone levels
 Milk temperature
 Fat and protein concentrations
Photo: afimilk
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (11) Cole
Other sources of data
• RFID tags have lower ID
error rates associated with
meter data
• Pedometers are useful for
detecting estrus, the
onset of calving, and
some early-stage
disease
Top: Allflex; Bottom: afimilk
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (12) Cole
Current sources of data
AIPL CDCB
NAAB
PDCA
DHI
Universities
AIPL Animal Improvement Programs Lab., USDA
CDCB Council on Dairy Cattle Breeding
DHI Dairy Herd Improvement (milk recording organizations)
NAAB National Association of Animal Breeders (AI)
PDCA Purebred Dairy Cattle Association (breed registries)
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (13) Cole
Sources of genomic data
AIPL
Requester
(Ex: AI, breeds)
Dairy
producers
DNA
laboratories
samples
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (14) Cole
Data source integration
• Incoming data from different sources
are checked against one another
• The AIPL edits system consists of
~64,000 SLOC
 Mostly C, some Fortran 90
• Data stored in a relational database
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (15) Cole
Typical edits
• Match birth date with dam’s calving
• Compare with other sources (e.g. breed
association)
• Investigate maternal sibs born within 9
mo (may assume ET)
• IDs within 100 with same sire, dam, and
birth assumed to be twins
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (16) Cole
How do we assess data quality
• Consistency
 e.g., calving, progeny birth,
breeding, dry dates
• Parentage verification
• Electronic ID
• Within-herd heritability
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (17) Cole
Data mining
• The discovery of useful, possibly
unexpected patterns in data
• Four principal tasks
 Association
 Clustering
 Classification
 Regression
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (18) Cole
Bonferroni’s principle
• You will find interesting patterns if
you look hard enough
• Not all relationships are legitimate
• You must have enough data to
support the questions you’re
asking
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (19) Cole
Association analysis
• Discover interesting relationships
among variables in large databases
 e.g., predicting protein function and
identifying SNP-disease associations
 Not statistical association analysis!
• Lots of algorithms, many based on
counting attributes
• Watch for false positives
 Measures co-occurence, not causality
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (20) Cole
Clustering
• Place items into distinct groups
such that
 Items in a group are similar
 Items in one group are dissimilar to
those in other groups
• Hierarchical or partitional
approaches
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (21) Cole
Partitional clustering
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (22) Cole
Hierarchical clustering
• Nested clusters organized into
hierarchical trees
• Data objects may belong to
multiple subsets
• Examples
 Relationships among species
 Evolutionary history of proteins
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (23) Cole
BFGL-Illumina
Deep SNP Discovery
Angus
Holstein
Limousin
Jersey
Nelore
Brahman
Romagnola
Gir
BFGL
Genome Assemblies
Nelore
Water Buffalo
Pfizer
Light SNP Discovery
Angus
Holstein
Jersey
Hereford
Charolais
Simmental
Brahman
Waygu
Partners
Deep SNP Discovery
N’Dama
Sahiwal
Simmental
Hanwoo
Blonde d’Aquitaine
Montbeliard
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (24) Cole
Classification
• Training set used to develop a rule
for assigning individuals to classes
• Validation set used to assess the
accuracy of the classification rule
• Examples
 Identify cows with subclinical mastitis
 Mate assignment
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (25) Cole
Classification methods
• Bayesian belief networks
• Decision trees
• Nearest-neighbor classification
• Neural networks
• Rule-based classification
• Support vector machines
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (26) Cole
Decision tree classification
Pinzón-Sánchezetal.,2011,JDS,94:1873-1892.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (27) Cole
Rule-based classification
• Classify records using a series of
“if…then” rules
• Rules come directly from the data,
or from other classification models
• e.g., if (PTA NM$ ≥ $800) and (EFI ≤
0.05) then (breed to cow)
• Easy to generate and interpret
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (28) Cole
Regression models
• Prediction of real-valued outputs
• Given one or more attributes, we
can predict, for example—
 Breeding values
 Feed intake
 Milk and components yields
• Very mature analytical tools
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (29) Cole
Visualization
• How do we present lots of numbers
in a compact form?
• “Graphical methods can retain the
information in the data.” ― Deming
• Complements numerical
techniques
 Tukey (1977), Tufte (1983, 1990,
1997, 2006) , Cleveland (1985,
1993), Wickham (2009)
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (30) Cole
One image, millions of points
43,382 SNP solutions x 4,064 animals = 176,304,448 data points
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (31) Cole
Use size to denote importance
Colors differentiate among chromosomes and markers are proportional to effect sizes.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (32) Cole
O-Style Haplotypes (chromosome 15)
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (33) Cole
Correlations among calving traits
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (34) Cole
Provide multiple cues
Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10.
Lines are differentiated by color and pattern.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (35) Cole
Interstitial figures
Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (36) Cole
Computational capacity is abundant
WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (37) Cole
Supercomputer performance
• Cray-1 (1976) — 136
megaFLOPS (106)
• Fujitsu K machine
(2011) — 8.16
petaFLOPS (1015)
• Commodity hardware
also has experienced
gains in performance Top: Sherwin Gooch; Bottom: Riken
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (38) Cole
Storage costs are plummeting
Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (39) Cole
Data storage technologies
• Storage costs are now as
low as $100/TB
 Quality costs!
• Solid state disks are
promising, but relatively
low-capacity
• What do you do about
backups? Top: Snopes/IBM; Bottom: Tom’s Hardware
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (40) Cole
Memory is very cheap
Lev Lafayette, http://www.organdi.net/article.php3?id_article=82
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (41) Cole
Random access memory
• RAM is still much
faster than disk (ns
vs. ms access times)
• A 64-bit OS can
address 16.8 EB, in
theory
• How much can your
motherboard hold?
Top: Stan Yack; Bottom: Samsung
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (42) Cole
Software
• Complexity is increasing
 Parallelism is hard and debugging is
much harder
• Productive developers are expensive
and difficult to find
 A top programmer may be 10x as
productive as an average worker
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (43) Cole
Conclusions
• The more data we get, the more data
we want
• Relationships among traits may become
as important as individual traits
• Software may be more limiting than
hardware
Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (44) Cole
Questions?

More Related Content

Similar to Data Structures and Visualization

Idcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleIdcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleEric Kansa
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 
Using the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support EcoinformaticsUsing the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support Ecoinformaticsebiquity
 
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...Peter McQuilton
 
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...Jeffrey Bewley
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Finding and accessing human genome data with Repositive
Finding and accessing human genome data with RepositiveFinding and accessing human genome data with Repositive
Finding and accessing human genome data with RepositiveManuel Corpas
 
SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...Fiona Nielsen
 
Final From journal on website
Final From journal on websiteFinal From journal on website
Final From journal on websiteMichael Clawson
 
Strata 2011 - Real world apps panel - IPUMS International
Strata 2011 - Real world apps panel - IPUMS InternationalStrata 2011 - Real world apps panel - IPUMS International
Strata 2011 - Real world apps panel - IPUMS InternationalPete Clark
 
Setting the stage with beginning data analyses
Setting the stage with beginning data analysesSetting the stage with beginning data analyses
Setting the stage with beginning data analyseshuebner14
 
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...GigaScience, BGI Hong Kong
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...Fiona Nielsen
 
Big Data Initiatives for Agroecosystems
Big Data Initiatives for AgroecosystemsBig Data Initiatives for Agroecosystems
Big Data Initiatives for AgroecosystemsCyndy Parr
 
Highly dimensional data_20160926
Highly dimensional data_20160926Highly dimensional data_20160926
Highly dimensional data_20160926Laura Clarke
 
ICG-11 - genomic data projects around the world - nov 5 2016
ICG-11 - genomic data projects around the world - nov 5 2016ICG-11 - genomic data projects around the world - nov 5 2016
ICG-11 - genomic data projects around the world - nov 5 2016Fiona Nielsen
 

Similar to Data Structures and Visualization (20)

Idcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleIdcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckle
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 
Using the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support EcoinformaticsUsing the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support Ecoinformatics
 
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
AMIA Webinar - BioSharing - Mapping the landscape of standards in the life sc...
 
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
Potential for New Dairy Cattle Phenotypic Data from Automated Technology Meas...
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Finding and accessing human genome data with Repositive
Finding and accessing human genome data with RepositiveFinding and accessing human genome data with Repositive
Finding and accessing human genome data with Repositive
 
SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...
 
Hands-on Introduction to Machine Learning
Hands-on Introduction to Machine LearningHands-on Introduction to Machine Learning
Hands-on Introduction to Machine Learning
 
Final From journal on website
Final From journal on websiteFinal From journal on website
Final From journal on website
 
Strata 2011 - Real world apps panel - IPUMS International
Strata 2011 - Real world apps panel - IPUMS InternationalStrata 2011 - Real world apps panel - IPUMS International
Strata 2011 - Real world apps panel - IPUMS International
 
Setting the stage with beginning data analyses
Setting the stage with beginning data analysesSetting the stage with beginning data analyses
Setting the stage with beginning data analyses
 
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
 
A Stocktake of New Zealand’s Healthcare Datasets
A Stocktake of New Zealand’s Healthcare DatasetsA Stocktake of New Zealand’s Healthcare Datasets
A Stocktake of New Zealand’s Healthcare Datasets
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
 
Big Data Initiatives for Agroecosystems
Big Data Initiatives for AgroecosystemsBig Data Initiatives for Agroecosystems
Big Data Initiatives for Agroecosystems
 
Highly dimensional data_20160926
Highly dimensional data_20160926Highly dimensional data_20160926
Highly dimensional data_20160926
 
SOC2002 Lecture 6
SOC2002 Lecture 6SOC2002 Lecture 6
SOC2002 Lecture 6
 
ICG-11 - genomic data projects around the world - nov 5 2016
ICG-11 - genomic data projects around the world - nov 5 2016ICG-11 - genomic data projects around the world - nov 5 2016
ICG-11 - genomic data projects around the world - nov 5 2016
 

More from John B. Cole, Ph.D.

Using genotypes to construct phenotypes for dairy cattle breeding programs an...
Using genotypes to construct phenotypes for dairy cattle breeding programs an...Using genotypes to construct phenotypes for dairy cattle breeding programs an...
Using genotypes to construct phenotypes for dairy cattle breeding programs an...John B. Cole, Ph.D.
 
If we would see further than others: research & technology today and tomorrow
If we would see further than others: research & technology today and tomorrowIf we would see further than others: research & technology today and tomorrow
If we would see further than others: research & technology today and tomorrowJohn B. Cole, Ph.D.
 
Using genotyping and whole-genome sequencing to identify causal variants asso...
Using genotyping and whole-genome sequencing to identify causal variants asso...Using genotyping and whole-genome sequencing to identify causal variants asso...
Using genotyping and whole-genome sequencing to identify causal variants asso...John B. Cole, Ph.D.
 
Genetic improvement programs for US dairy cattle
Genetic improvement programs for US dairy cattleGenetic improvement programs for US dairy cattle
Genetic improvement programs for US dairy cattleJohn B. Cole, Ph.D.
 
The hunt for a functional mutation affecting conformation and calving traits ...
The hunt for a functional mutation affecting conformation and calving traits ...The hunt for a functional mutation affecting conformation and calving traits ...
The hunt for a functional mutation affecting conformation and calving traits ...John B. Cole, Ph.D.
 
An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...John B. Cole, Ph.D.
 
An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...John B. Cole, Ph.D.
 
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...John B. Cole, Ph.D.
 
Stillbirth, Longevity and Fertility Update
Stillbirth, Longevity and Fertility UpdateStillbirth, Longevity and Fertility Update
Stillbirth, Longevity and Fertility UpdateJohn B. Cole, Ph.D.
 
New tools for genomic selection in dairy cattle
New tools for genomic selection in dairy cattleNew tools for genomic selection in dairy cattle
New tools for genomic selection in dairy cattleJohn B. Cole, Ph.D.
 
Opportunities for genetic improvement of health and fitness traits
Opportunities for genetic improvement of health and fitness traitsOpportunities for genetic improvement of health and fitness traits
Opportunities for genetic improvement of health and fitness traitsJohn B. Cole, Ph.D.
 
Genomic selection and systems biology – lessons from dairy cattle breeding
Genomic selection and systems biology – lessons from dairy cattle breedingGenomic selection and systems biology – lessons from dairy cattle breeding
Genomic selection and systems biology – lessons from dairy cattle breedingJohn B. Cole, Ph.D.
 
Use of NGS to identify the causal variant associated with a complex phenotype
Use of NGS to identify the causal variant associated with a complex phenotypeUse of NGS to identify the causal variant associated with a complex phenotype
Use of NGS to identify the causal variant associated with a complex phenotypeJohn B. Cole, Ph.D.
 
Genomic evaluation of dairy cattle health
Genomic evaluation of dairy cattle healthGenomic evaluation of dairy cattle health
Genomic evaluation of dairy cattle healthJohn B. Cole, Ph.D.
 
Uso e valore economico dei test genomici in azienda
Uso e valore economico dei test genomici in aziendaUso e valore economico dei test genomici in azienda
Uso e valore economico dei test genomici in aziendaJohn B. Cole, Ph.D.
 
The use and economic value of genomic testing for calves on dairy farms
The use and economic value of genomic testing for calves on dairy farmsThe use and economic value of genomic testing for calves on dairy farms
The use and economic value of genomic testing for calves on dairy farmsJohn B. Cole, Ph.D.
 
Genomic evaluation of low-heritability traits: dairy cattle health as a model
Genomic evaluation of low-heritability traits: dairy cattle health as a modelGenomic evaluation of low-heritability traits: dairy cattle health as a model
Genomic evaluation of low-heritability traits: dairy cattle health as a modelJohn B. Cole, Ph.D.
 
New applications of genomic technology in the US dairy industry
New applications of genomic technology in the US dairy industryNew applications of genomic technology in the US dairy industry
New applications of genomic technology in the US dairy industryJohn B. Cole, Ph.D.
 

More from John B. Cole, Ph.D. (20)

Crv 2015 jbc
Crv 2015 jbcCrv 2015 jbc
Crv 2015 jbc
 
Using genotypes to construct phenotypes for dairy cattle breeding programs an...
Using genotypes to construct phenotypes for dairy cattle breeding programs an...Using genotypes to construct phenotypes for dairy cattle breeding programs an...
Using genotypes to construct phenotypes for dairy cattle breeding programs an...
 
2015 AGIL Update
2015 AGIL Update2015 AGIL Update
2015 AGIL Update
 
If we would see further than others: research & technology today and tomorrow
If we would see further than others: research & technology today and tomorrowIf we would see further than others: research & technology today and tomorrow
If we would see further than others: research & technology today and tomorrow
 
Using genotyping and whole-genome sequencing to identify causal variants asso...
Using genotyping and whole-genome sequencing to identify causal variants asso...Using genotyping and whole-genome sequencing to identify causal variants asso...
Using genotyping and whole-genome sequencing to identify causal variants asso...
 
Genetic improvement programs for US dairy cattle
Genetic improvement programs for US dairy cattleGenetic improvement programs for US dairy cattle
Genetic improvement programs for US dairy cattle
 
The hunt for a functional mutation affecting conformation and calving traits ...
The hunt for a functional mutation affecting conformation and calving traits ...The hunt for a functional mutation affecting conformation and calving traits ...
The hunt for a functional mutation affecting conformation and calving traits ...
 
An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...
 
An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...An updated version of lifetime net merit incorporating additional fertility t...
An updated version of lifetime net merit incorporating additional fertility t...
 
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
Genetic Evaluation of Stillbirth in US Holsteins Using a Sire-maternal Grands...
 
Stillbirth, Longevity and Fertility Update
Stillbirth, Longevity and Fertility UpdateStillbirth, Longevity and Fertility Update
Stillbirth, Longevity and Fertility Update
 
New tools for genomic selection in dairy cattle
New tools for genomic selection in dairy cattleNew tools for genomic selection in dairy cattle
New tools for genomic selection in dairy cattle
 
Opportunities for genetic improvement of health and fitness traits
Opportunities for genetic improvement of health and fitness traitsOpportunities for genetic improvement of health and fitness traits
Opportunities for genetic improvement of health and fitness traits
 
Genomic selection and systems biology – lessons from dairy cattle breeding
Genomic selection and systems biology – lessons from dairy cattle breedingGenomic selection and systems biology – lessons from dairy cattle breeding
Genomic selection and systems biology – lessons from dairy cattle breeding
 
Use of NGS to identify the causal variant associated with a complex phenotype
Use of NGS to identify the causal variant associated with a complex phenotypeUse of NGS to identify the causal variant associated with a complex phenotype
Use of NGS to identify the causal variant associated with a complex phenotype
 
Genomic evaluation of dairy cattle health
Genomic evaluation of dairy cattle healthGenomic evaluation of dairy cattle health
Genomic evaluation of dairy cattle health
 
Uso e valore economico dei test genomici in azienda
Uso e valore economico dei test genomici in aziendaUso e valore economico dei test genomici in azienda
Uso e valore economico dei test genomici in azienda
 
The use and economic value of genomic testing for calves on dairy farms
The use and economic value of genomic testing for calves on dairy farmsThe use and economic value of genomic testing for calves on dairy farms
The use and economic value of genomic testing for calves on dairy farms
 
Genomic evaluation of low-heritability traits: dairy cattle health as a model
Genomic evaluation of low-heritability traits: dairy cattle health as a modelGenomic evaluation of low-heritability traits: dairy cattle health as a model
Genomic evaluation of low-heritability traits: dairy cattle health as a model
 
New applications of genomic technology in the US dairy industry
New applications of genomic technology in the US dairy industryNew applications of genomic technology in the US dairy industry
New applications of genomic technology in the US dairy industry
 

Recently uploaded

Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 

Recently uploaded (20)

Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 

Data Structures and Visualization

  • 1. J. B. Cole Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD 20705-2350 john.cole@ars.usda.gov Data Structures and Visualization
  • 2. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (2) Cole Introduction • We’re drowning in information • Genetics are viewed as a commodity • We need to get better data from fewer cows • Do we have the resources we need?
  • 3. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (3) Cole U.S. dairy population 0 5 10 15 20 25 30 40 50 60 70 80 90 00 Year Cows(millions)
  • 4. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (4) Cole We need to do more with less • 47% of U.S. dairy cows are enrolled in DHIA testing • The Class III milk is $17/cwt • Grain prices are very high  Corn averaged $6/bu in May  Soybeans averaged $13/bu in May • Enrollment and cow numbers are unlikely to increase
  • 5. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (5) Cole Major topics • Different sources of data • Data source integration and quality • Data mining models • Visualization examples • Computational resources
  • 6. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (6) Cole Data currently in national database • Identification and registration • Conformation scores • Milk production and composition • Fertility • Longevity • Some genotypes
  • 7. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (7) Cole What are big data? Type of Record Number of Records1 Cows with lactation data 28,394,976 Lactations 68,373,863 Individual test days 508,574,732 Calving ease records 20,770,758 Animals in pedigree file 58,893,009 Bull genotypes 50,393 Cow genotypes 70,687 1Totals include animals from all breeds.
  • 8. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (8) Cole Data not routinely available • Farm and herd management  Geography and climate  Housing systems  Feed intake • Milk composition  Milk fats, proteins, vitamins, minerals  Conductivity, lactose, MUN • DNA data  Cow SNP genotypes, DNA sequence data Photo: NOAA
  • 9. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (9) Cole Data “trapped” on the farm • Fertility  Insemination information  Use of estrus synchronization • Cow health and longevity  Body condition scores  Birth weights and mature weights  Disease occurrence data
  • 10. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (10) Cole Electronic milk meters • Currently can provide—  Milk yield  Milking speed  Electrical conductivity • May possibly supply—  Progesterone levels  Milk temperature  Fat and protein concentrations Photo: afimilk
  • 11. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (11) Cole Other sources of data • RFID tags have lower ID error rates associated with meter data • Pedometers are useful for detecting estrus, the onset of calving, and some early-stage disease Top: Allflex; Bottom: afimilk
  • 12. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (12) Cole Current sources of data AIPL CDCB NAAB PDCA DHI Universities AIPL Animal Improvement Programs Lab., USDA CDCB Council on Dairy Cattle Breeding DHI Dairy Herd Improvement (milk recording organizations) NAAB National Association of Animal Breeders (AI) PDCA Purebred Dairy Cattle Association (breed registries)
  • 13. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (13) Cole Sources of genomic data AIPL Requester (Ex: AI, breeds) Dairy producers DNA laboratories samples
  • 14. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (14) Cole Data source integration • Incoming data from different sources are checked against one another • The AIPL edits system consists of ~64,000 SLOC  Mostly C, some Fortran 90 • Data stored in a relational database
  • 15. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (15) Cole Typical edits • Match birth date with dam’s calving • Compare with other sources (e.g. breed association) • Investigate maternal sibs born within 9 mo (may assume ET) • IDs within 100 with same sire, dam, and birth assumed to be twins
  • 16. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (16) Cole How do we assess data quality • Consistency  e.g., calving, progeny birth, breeding, dry dates • Parentage verification • Electronic ID • Within-herd heritability
  • 17. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (17) Cole Data mining • The discovery of useful, possibly unexpected patterns in data • Four principal tasks  Association  Clustering  Classification  Regression
  • 18. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (18) Cole Bonferroni’s principle • You will find interesting patterns if you look hard enough • Not all relationships are legitimate • You must have enough data to support the questions you’re asking
  • 19. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (19) Cole Association analysis • Discover interesting relationships among variables in large databases  e.g., predicting protein function and identifying SNP-disease associations  Not statistical association analysis! • Lots of algorithms, many based on counting attributes • Watch for false positives  Measures co-occurence, not causality
  • 20. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (20) Cole Clustering • Place items into distinct groups such that  Items in a group are similar  Items in one group are dissimilar to those in other groups • Hierarchical or partitional approaches
  • 21. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (21) Cole Partitional clustering
  • 22. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (22) Cole Hierarchical clustering • Nested clusters organized into hierarchical trees • Data objects may belong to multiple subsets • Examples  Relationships among species  Evolutionary history of proteins
  • 23. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (23) Cole BFGL-Illumina Deep SNP Discovery Angus Holstein Limousin Jersey Nelore Brahman Romagnola Gir BFGL Genome Assemblies Nelore Water Buffalo Pfizer Light SNP Discovery Angus Holstein Jersey Hereford Charolais Simmental Brahman Waygu Partners Deep SNP Discovery N’Dama Sahiwal Simmental Hanwoo Blonde d’Aquitaine Montbeliard
  • 24. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (24) Cole Classification • Training set used to develop a rule for assigning individuals to classes • Validation set used to assess the accuracy of the classification rule • Examples  Identify cows with subclinical mastitis  Mate assignment
  • 25. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (25) Cole Classification methods • Bayesian belief networks • Decision trees • Nearest-neighbor classification • Neural networks • Rule-based classification • Support vector machines
  • 26. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (26) Cole Decision tree classification Pinzón-Sánchezetal.,2011,JDS,94:1873-1892.
  • 27. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (27) Cole Rule-based classification • Classify records using a series of “if…then” rules • Rules come directly from the data, or from other classification models • e.g., if (PTA NM$ ≥ $800) and (EFI ≤ 0.05) then (breed to cow) • Easy to generate and interpret
  • 28. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (28) Cole Regression models • Prediction of real-valued outputs • Given one or more attributes, we can predict, for example—  Breeding values  Feed intake  Milk and components yields • Very mature analytical tools
  • 29. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (29) Cole Visualization • How do we present lots of numbers in a compact form? • “Graphical methods can retain the information in the data.” ― Deming • Complements numerical techniques  Tukey (1977), Tufte (1983, 1990, 1997, 2006) , Cleveland (1985, 1993), Wickham (2009)
  • 30. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (30) Cole One image, millions of points 43,382 SNP solutions x 4,064 animals = 176,304,448 data points
  • 31. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (31) Cole Use size to denote importance Colors differentiate among chromosomes and markers are proportional to effect sizes.
  • 32. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (32) Cole O-Style Haplotypes (chromosome 15)
  • 33. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (33) Cole Correlations among calving traits
  • 34. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (34) Cole Provide multiple cues Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10. Lines are differentiated by color and pattern.
  • 35. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (35) Cole Interstitial figures Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.
  • 36. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (36) Cole Computational capacity is abundant WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg
  • 37. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (37) Cole Supercomputer performance • Cray-1 (1976) — 136 megaFLOPS (106) • Fujitsu K machine (2011) — 8.16 petaFLOPS (1015) • Commodity hardware also has experienced gains in performance Top: Sherwin Gooch; Bottom: Riken
  • 38. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (38) Cole Storage costs are plummeting Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte
  • 39. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (39) Cole Data storage technologies • Storage costs are now as low as $100/TB  Quality costs! • Solid state disks are promising, but relatively low-capacity • What do you do about backups? Top: Snopes/IBM; Bottom: Tom’s Hardware
  • 40. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (40) Cole Memory is very cheap Lev Lafayette, http://www.organdi.net/article.php3?id_article=82
  • 41. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (41) Cole Random access memory • RAM is still much faster than disk (ns vs. ms access times) • A 64-bit OS can address 16.8 EB, in theory • How much can your motherboard hold? Top: Stan Yack; Bottom: Samsung
  • 42. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (42) Cole Software • Complexity is increasing  Parallelism is hard and debugging is much harder • Productive developers are expensive and difficult to find  A top programmer may be 10x as productive as an average worker
  • 43. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (43) Cole Conclusions • The more data we get, the more data we want • Relationships among traits may become as important as individual traits • Software may be more limiting than hardware
  • 44. Really Big Data: Processing and Analysis of Very Large Datasets, ADSA/ASAS Joint Annual Meeting, July 2011 (44) Cole Questions?