High-level use case description of one department of a hospital, and comparisons of two solutions : 1) Big data solution using Cloudera Impala; and 2) Traditional RDBMS solution using Oracle DB.
2. Presentation
• Our Objectives
• Requirements and context
• Project scope
• Hadoop Solution
– Big Data Solution Overview
– Hive Table Schema
– Compression Performance
– Data Architecture in Hadoop
– Hadoop/Impala Prototype Demo
• Oracle Solution
• Hadoop vs Oracle comparison
• What are expensive queries?
2 / 18
3. Our Objectives
• Lead an end-of-study project in an
industrial context
– Requirements elicitation
– Implement a « proof-of-concept » prototype
• Experiment with big data technologies
– Compare with RDBMS
3 / 18
4. Requirements and context
• Department of Medical Diagnostic
(medical test results DB, e.g. blood, urine, ...)
– Dr. Shaun Eintracht
• « ad hoc » Query
• ETL Query
– Dr. Elizabeth Mac Namara
• « business intelligence » requirements
• Realtime Dashboard
• Department of Endocrinology
– Dr. Mark Trifiro
• Data mining
4 / 18
5. Project scope
• First iteration = improve ad-hoc queries
– Slow analytical queries and ETL (MS Access)
– Risk of « crashing » production DB
– Some queries impossible to process
5 / 18
11. Data Architecture in Hadoop
• All big tables are pre-joined
– With specimen (1)
– Without specimen (2)
• Partitioned using two schemes
– Year-month (3)
– Year and Test (4)
• 4 different versions of the same data:
– stay_order_results_yearmonth
– stay_order_results_year_and_test
– stay_order_results_specimen_yearmonth
– stay_order_results_specimen_year_and_test
11 / 18
13. Oracle Solution
• Same tables as source DB
– A big pre-joined table is not a good solution
• Techniques explored :
– Partitioning
• Partitions automatically created
– Compression
• Inefficient for joins
– Clustering
– Join multiple partitioned tables
13 / 18
14. Oracle Solution (continued)
• Avoid too many indexes on the big tables:
– Takes a lot of memory
– Slow to create
– May not be used if query use more than 5% of the
rows
14 / 18
15. Comparison: Hadoop Solution
• Pro
– Crunch massive amount of data
– Scalability
– Free software
• Cons
– Needs better UI and tune-ups
– Maintenance cost
– Require ETL time to merge data into one table
– BIG Joins should be avoided
15 / 18
16. Comparison: Oracle Solution
• Pro
– Just need to create a slave DB (just?)
– Faster random-lookup
– Easier to find expertise
• Cons
– Scalability up to a certain point..
– Synchronisation with master DB:
• Rebuilding indexes would take hours
16 / 18
17. What are expensive queries?
• If possible, avoid these constructs on
large result sets
– SELECT DISTINCT
– ORDER BY
– GROUP BY
– JOIN big table with another big table
• JOIN big table with multiple small tables should be OK
17 / 18
18. Conclusion
• Recommendation to use a “classic” RDBMS
– The database fit on a single-node
– Existing expertise in-house
– Acceptable performance with appropriate
tune-ups
– Stop using MS Access
• Disadvantage : limited scalability
18 / 18
Notes de l'éditeur
ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
NE PARLERONS PAS DE : Extraction des exigences
25% plusrapide avec compression Snappy (5.5X compression)Impala 80% plus rapidequ’Oracle
ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie