Huge Data Analytics: Calpont InfiniDB Columnar DBMS Empowers New Research with The World’s First Searchable Genotype Database1. Calpont InfiniDB®
Accelerating Data Insights
Huge Data Analytics: Calpont InfiniDB
Columnar DBMS Empowers New Research
with The World’s First Searchable Genotype
Database
Strata Conference 2012
Calpont Proprietary and Confidential
2. Today’s Agenda
•Introduction of today’s speakers
•What is InfiniDB?
•Announced today: InfiniDB 3
•Huge Data Analytics: InfiniDB Empowers New
Research with The World’s First Searchable
Genotype Database
•Questions
•More information and resources
InfiniDB® Scalable. Fast. Simple. 2 Copyright © 2012 Calpont. All Rights Reserved.
3. Today’s Presenters
Fernanda Foertter
HPC Administrator / Scientific Programmer
Genus plc
Jim Tommaney
Chief Technology Officer
Calpont Corporation
InfiniDB® Scalable. Fast. Simple. 3 Copyright © 2012 Calpont. All Rights Reserved.
5. Calpont Corporation
• Company
o Privately held and backed Calpont Mission
o Offices To provide a highly
Dallas (Headquarters) scalable data
Silicon Valley platform that enables
analytic business
• Business decisions as timely
o Scale-out MPP analytic database as customers and
markets dictate.
o MySQL Columnar + Map Reduction
o Commercial Open Core model
• Products
o InfiniDB Enterprise
Forthcoming 4th major release
o InfiniDB Community
Modified Open Source license
InfiniDB® Scalable. Fast. Simple. 5 Copyright © 2012 Calpont. All Rights Reserved.
7. What is InfiniDB?
®
Scalable Fast Simple
InfiniDB® Scalable. Fast. Simple. 7 Copyright © 2012 Calpont. All Rights Reserved.
8. What is InfiniDB?
Big Data
Analytics Engine
Full-Featured Familiar MySQL
SQL Look and Feel
InfiniDB
Game Changing Performance
InfiniDB® Scalable. Fast. Simple. 8 Copyright © 2012 Calpont. All Rights Reserved.
9. Focus on Analytics Workloads
InfiniDB is …
Engineered for large queries
Engineered for ad-hoc flexibility
Analytics, not OLTP
Unique combination of columnar + map-reduce
InfiniDB® Scalable. Fast. Simple. 9 Copyright © 2012 Calpont. All Rights Reserved.
10. What is InfiniDB?
®
Scalable Fast Simple
InfiniDB® Scalable. Fast. Simple. 10 Copyright © 2012 Calpont. All Rights Reserved.
11. InfiniDB – Two Tier Architecture
or …
Purpose built for big data analytics.
• User Module (UM) Single Server
Understands SQL.
• Performance Module (PM)
Operates on data blocks.
InfiniDB® Scalable. Fast. Simple. 11 Copyright © 2012 Calpont. All Rights Reserved.
12. InfiniDB Performance Foundations
®
The Power and Scale of Map-Reduce
plus
Transformational I/O Efficiency
InfiniDB® Scalable. Fast. Simple. 12 Copyright © 2012 Calpont. All Rights Reserved.
13. Power and Scalability of Map-Reduce
Map ↓↓↓↓↓ Reduce ↑↑↑↑↑
SQL Operations are mapped to Performance Module threads
• Parallel/Distributed Data Access
• Parallel/Distributed Joins (Inner, Outer)
• Parallel/Distributed Sub-queries (From, Where, Select)
• Parallel/Distributed Group By, Distinct, and Aggregation
• Extensible with Parallel/Distributed User Defined Functions
Results are returned to User Module in Reduce Phase
InfiniDB® Scalable. Fast. Simple. 13 Copyright © 2012 Calpont. All Rights Reserved.
14. Power and Scalability of Map-Reduce
Map ↓↓↓↓↓ Reduce ↑↑↑↑↑
InfiniDB is not:
… a hadoop style map-reduce framework.
InfiniDB® Scalable. Fast. Simple. 14 Copyright © 2012 Calpont. All Rights Reserved.
15. Power and Scalability of Map-Reduce
Map ↓↓↓↓↓ Reduce ↑↑↑↑↑
InfiniDB is:
… custom built and highly optimized map-
reduce framework for queries.
InfiniDB® Scalable. Fast. Simple. 15 Copyright © 2012 Calpont. All Rights Reserved.
16. Transformational I/O Efficiency
Techniques to Avoid Unnecessary I/O
oVertical Partitioning: read only the columns required
oHorizontal Partition: focus on the rows required
oJust-in-time materialization
InfiniDB® Scalable. Fast. Simple. 16 Copyright © 2012 Calpont. All Rights Reserved.
17. Transformational I/O Efficiency
Techniques for Efficient I/O
oColumnar compression reduces I/O from disk
oGlobal data buffer cache can reduce disk I/O
oReal-time decompression accelerates reads from disk
oAvoidance of Random I/O
InfiniDB® Scalable. Fast. Simple. 17 Copyright © 2012 Calpont. All Rights Reserved.
18. Simple - Automatic Everything
• Vertical Partitioning
• Horizontal Partitioning Simple
• Compression
• Compression Algorithm Selection
• Distribution of data across disk resources
• Distribution of work across CPU resources
InfiniDB® Scalable. Fast. Simple. 18 Copyright © 2012 Calpont. All Rights Reserved.
19. InfiniDB
®
Scalable Fast Simple
InfiniDB® Scalable. Fast. Simple. 19 Copyright © 2012 Calpont. All Rights Reserved.
21. InfiniDB 3: It is Now Possible...
InfiniDB
3
21
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
22. Today’s Presenters
Fernanda Foertter
HPC Administrator / Scientific Programmer
Genus plc
Jim Tommaney
Chief Technology Officer
Calpont Corporation
InfiniDB® Scalable. Fast. Simple. 22 Copyright © 2012 Calpont. All Rights Reserved.
24. Genetic Evaluation
Breeding Values
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
26. Selection for Lean Growth
1980 2005
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
27. Selection for Lean Growth
1980 2005
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
28. Halothane Gene (1991)
• Gene is associated
o High carcass yield (NN)
o Stress triggers
hyperthermia
o Poor meat quality
(Nn/nn)
X
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
29. DNA Marker Use
2004
1999 Large-scale SNP discovery
FUT1 & PRKAG3
1991 1994 1998 2003
HAL ESR RN & MC4R MIS
1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1990 Large-scale SNP discovery, 2009
genome scans,
sequencing
1991 - 2002
Single genes, QTL
Candidate genes
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
30. Sudden Data Growth
70000 Porcine SNP Panel Density
60000
Number of SNPs
50000
40000
30000
20000
10000
0
2004 2005 2006 2007 2008 2009
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
31. Sudden Data Growth
Sample Collection
16,000,000 3,500,000
Animals (cumulative) Tissue(cumulative)
14,000,000
3,000,000
12,000,000
2,500,000
10,000,000
2,000,000
8,000,000
6,000,000 1,500,000
4,000,000 1,000,000
2,000,000
500,000
0
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
0
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
Year
Year
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
32. Genetic Evaluation
EBV economic weights
Lean Yield
Meat Quality
Robustness
Feed efficiency
Etc
Index = a1 × EBV1 + a2 × EBV2 + . . .
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
35. Project: Genotyping DB
The Need Other Considerations
• Accumulating SNP chip data • Store large data…BIG data
• Difficulty searching through • Scalable
• Next Gen Sequencing • Alternative to Oracle
• Cheaper SNP chips • Minimally impact
• LOTS of animals infrastructure
• Other projects needed the • Easy for scientists to use
data
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
36. What Do Vendors Provide for Genotype
Data?
nothing
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
37. Think Outside the (Vendor’s) Box…
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
38. All Databases are Not Created Equal
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
39. All Vehicles are Not Created Equal
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
41. SNP Data
Animal ID SNP1 SNP2 SNP3 … SNP65K
1 0 1 2 1 2
2 1 1 0 0 0
3
4
5 1 2 2 0 2
…
XXXX
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
42. Single Research Cohort
What about selection and cohort comparisons?
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
43. Column Bases Make More Sense
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
46. Scales for a Fraction of the Cost
Compression Up 75%
Speed vs RDBMS 15X faster
Scalability 100’s TB, parallel queries/ingest
Cost vs Oracle 25%
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
48. Caution: Data multiplies in a BIG way
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
49. Conclusions
• Helps to have a deep understanding of the scientific
problems being solved
• Have a good understanding of the data access pattern
• Tool should solve 80% of the highest use patterns
• Use combination of software, hardware knowledge to
improve performance
• Think “out of the vendor box”, especially where
research is cutting edge
• Take the lead to show new tools users may not even be
aware they want/ need
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.
51. More Information on InfiniDB
Visit us at:
o www.Calpont.com
o www.InfiniDB.org
o Visit Booth #414 to register to win an iPad 3
InfiniDB® Scalable. Fast. Simple. Copyright © 2012 Calpont. All Rights Reserved.