Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
1Pivotal Confidential–Internal Use Only 1Pivotal Confidential–Internal Use Only
MPP vs Hadoop
Alexey Grishchenko
HUG Meetu...
2Pivotal Confidential–Internal Use Only
Agenda
Ÿ  Distributed Systems
Ÿ  MPP
Ÿ  Hadoop
Ÿ  MPP vs Hadoop
Ÿ  Summary
3Pivotal Confidential–Internal Use Only
Agenda
Ÿ  Distributed Systems
Ÿ  MPP
Ÿ  Hadoop
Ÿ  MPP vs Hadoop
Ÿ  Summary
4Pivotal Confidential–Internal Use Only
Distributed Systems
Avoid distributed systems in all the problems that potentially...
5Pivotal Confidential–Internal Use Only
Distributed Systems
Ÿ  Consensus problem
–  Paxos
–  RAFT
–  ZAB
–  etc.
Ÿ  Transa...
6Pivotal Confidential–Internal Use Only
Distributed Systems
Ÿ  CAP Theorem
7Pivotal Confidential–Internal Use Only
Distributed Systems
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynot...
8Pivotal Confidential–Internal Use Only
Distributed Systems
Reasons to use
Ÿ  Performance issues
–  More than 100’000 TPS
...
9Pivotal Confidential–Internal Use Only
Agenda
Ÿ  Distributed Systems
Ÿ  MPP
Ÿ  Hadoop
Ÿ  MPP vs Hadoop
Ÿ  Summary
10Pivotal Confidential–Internal Use Only
MPP
Main principles
Ÿ  Shared Nothing
Ÿ  Data Sharding
Ÿ  Data Replication
Ÿ  Dis...
11Pivotal Confidential–Internal Use Only
MPP
12Pivotal Confidential–Internal Use Only
MPP
Works well for
Ÿ  Relational data
Ÿ  Batch processing
Ÿ  Ad hoc analytical SQ...
13Pivotal Confidential–Internal Use Only
MPP
Not the best choice for
Ÿ  Non-relational data
Ÿ  OLTP and event stream proce...
14Pivotal Confidential–Internal Use Only
Agenda
Ÿ  Distributed Systems
Ÿ  MPP
Ÿ  Hadoop
Ÿ  MPP vs Hadoop
Ÿ  Summary
15Pivotal Confidential–Internal Use Only
Hadoop
Main Components
Ÿ  HDFS
Ÿ  YARN
Ÿ  MapReduce
Ÿ  HBase
Ÿ  Hive / Hive+Tez
16Pivotal Confidential–Internal Use Only
Hadoop
HDFS
Ÿ  Distributed filesystem
Ÿ  Block-level storage with big blocks
Ÿ  N...
17Pivotal Confidential–Internal Use Only
Hadoop
HDFS
18Pivotal Confidential–Internal Use Only
Hadoop
YARN
Ÿ  Cluster resource manager
Ÿ  Manages CPU and RAM allocation
Ÿ  Sche...
19Pivotal Confidential–Internal Use Only
Hadoop
YARN
20Pivotal Confidential–Internal Use Only
Hadoop
MapReduce
Ÿ  Framework for distributed data processing
Ÿ  Two main operati...
21Pivotal Confidential–Internal Use Only
Hadoop
MapReduce
22Pivotal Confidential–Internal Use Only
Hadoop
HBase
Ÿ  Distributed key-value store
Ÿ  Data is sharded by key
Ÿ  Data is ...
23Pivotal Confidential–Internal Use Only
Hadoop
HBase
24Pivotal Confidential–Internal Use Only
Hadoop
Hive
Ÿ  Query engine with SQL-like syntax
Ÿ  Translates HiveQL query to MR...
25Pivotal Confidential–Internal Use Only
Hadoop
Hive
26Pivotal Confidential–Internal Use Only
Hadoop
Works well for
Ÿ  Write Once Read Many
Ÿ  100+ server clusters
Ÿ  Both rel...
27Pivotal Confidential–Internal Use Only
Hadoop
Not the best choice for
Ÿ  Write-heavy workloads
Ÿ  Small clusters
Ÿ  Anal...
28Pivotal Confidential–Internal Use Only
Agenda
Ÿ  Distributed Systems
Ÿ  MPP
Ÿ  Hadoop
Ÿ  MPP vs Hadoop
Ÿ  Summary
29Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
30Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
31Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
32Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
33Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
34Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
35Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
36Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
37Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
38Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
39Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardwa...
40Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
41Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debug...
42Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debug...
43Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debug...
44Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debug...
45Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debug...
46Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debug...
47Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debug...
48Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debug...
49Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debug...
50Pivotal Confidential–Internal Use Only
Agenda
Ÿ  Distributed Systems
Ÿ  MPP
Ÿ  Hadoop
Ÿ  MPP vs Hadoop
Ÿ  Examples
Ÿ  Su...
51Pivotal Confidential–Internal Use Only
Summary
Use MPP for
Ÿ  Analytical DWH
Ÿ  Ad hoc analyst SQL queries and BI
Ÿ  Kee...
52Pivotal Confidential–Internal Use Only 52Pivotal Confidential–Internal Use Only
Questions?
BUILT FOR THE SPEED OF BUSINESS
Prochain SlideShare
Chargement dans…5
×

MPP vs Hadoop

87 085 vues

Publié le

This is the presentation I made on the Hadoop User Group Ireland meetup in Dublin. It covers the main ideas of both MPP, Hadoop and the distributed systems in general, and also how to chose the best option for you

Publié dans : Données & analyses

MPP vs Hadoop

  1. 1. 1Pivotal Confidential–Internal Use Only 1Pivotal Confidential–Internal Use Only MPP vs Hadoop Alexey Grishchenko HUG Meetup 28.11.2015
  2. 2. 2Pivotal Confidential–Internal Use Only Agenda Ÿ  Distributed Systems Ÿ  MPP Ÿ  Hadoop Ÿ  MPP vs Hadoop Ÿ  Summary
  3. 3. 3Pivotal Confidential–Internal Use Only Agenda Ÿ  Distributed Systems Ÿ  MPP Ÿ  Hadoop Ÿ  MPP vs Hadoop Ÿ  Summary
  4. 4. 4Pivotal Confidential–Internal Use Only Distributed Systems Avoid distributed systems in all the problems that potentially could be solved using non-distributed systems
  5. 5. 5Pivotal Confidential–Internal Use Only Distributed Systems Ÿ  Consensus problem –  Paxos –  RAFT –  ZAB –  etc. Ÿ  Transaction consistency –  2PC –  3PC
  6. 6. 6Pivotal Confidential–Internal Use Only Distributed Systems Ÿ  CAP Theorem
  7. 7. 7Pivotal Confidential–Internal Use Only Distributed Systems http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
  8. 8. 8Pivotal Confidential–Internal Use Only Distributed Systems Reasons to use Ÿ  Performance issues –  More than 100’000 TPS –  More than 4 GB/sec scan rate –  More than 100’000 IOPS Ÿ  Capacity issues –  More than 50TB of data Ÿ  DR and Geo-Distribution
  9. 9. 9Pivotal Confidential–Internal Use Only Agenda Ÿ  Distributed Systems Ÿ  MPP Ÿ  Hadoop Ÿ  MPP vs Hadoop Ÿ  Summary
  10. 10. 10Pivotal Confidential–Internal Use Only MPP Main principles Ÿ  Shared Nothing Ÿ  Data Sharding Ÿ  Data Replication Ÿ  Distributed Transactions Ÿ  Parallel Processing
  11. 11. 11Pivotal Confidential–Internal Use Only MPP
  12. 12. 12Pivotal Confidential–Internal Use Only MPP Works well for Ÿ  Relational data Ÿ  Batch processing Ÿ  Ad hoc analytical SQL Ÿ  Low concurrency Ÿ  Applications requiring ANSI SQL
  13. 13. 13Pivotal Confidential–Internal Use Only MPP Not the best choice for Ÿ  Non-relational data Ÿ  OLTP and event stream processing Ÿ  High concurrency Ÿ  100+ server clusters Ÿ  Non-analytical use cases Ÿ  Geo-Distributed use cases
  14. 14. 14Pivotal Confidential–Internal Use Only Agenda Ÿ  Distributed Systems Ÿ  MPP Ÿ  Hadoop Ÿ  MPP vs Hadoop Ÿ  Summary
  15. 15. 15Pivotal Confidential–Internal Use Only Hadoop Main Components Ÿ  HDFS Ÿ  YARN Ÿ  MapReduce Ÿ  HBase Ÿ  Hive / Hive+Tez
  16. 16. 16Pivotal Confidential–Internal Use Only Hadoop HDFS Ÿ  Distributed filesystem Ÿ  Block-level storage with big blocks Ÿ  Non-updatable Ÿ  Synchronous block replication Ÿ  No built-in Geo-Distribution support Ÿ  No built-in DR solution
  17. 17. 17Pivotal Confidential–Internal Use Only Hadoop HDFS
  18. 18. 18Pivotal Confidential–Internal Use Only Hadoop YARN Ÿ  Cluster resource manager Ÿ  Manages CPU and RAM allocation Ÿ  Schedulers are pluggable Ÿ  Can handle different resource pools Ÿ  Supports both MR and non-MR workload
  19. 19. 19Pivotal Confidential–Internal Use Only Hadoop YARN
  20. 20. 20Pivotal Confidential–Internal Use Only Hadoop MapReduce Ÿ  Framework for distributed data processing Ÿ  Two main operations: map and reduce Ÿ  Data hits disk after “map” and before “reduce” Ÿ  Scales to thousands of servers Ÿ  Can process petabytes of data Ÿ  Extremely reliable
  21. 21. 21Pivotal Confidential–Internal Use Only Hadoop MapReduce
  22. 22. 22Pivotal Confidential–Internal Use Only Hadoop HBase Ÿ  Distributed key-value store Ÿ  Data is sharded by key Ÿ  Data is stored in sorted order Ÿ  Stores multiple versions of the row Ÿ  Easily scales
  23. 23. 23Pivotal Confidential–Internal Use Only Hadoop HBase
  24. 24. 24Pivotal Confidential–Internal Use Only Hadoop Hive Ÿ  Query engine with SQL-like syntax Ÿ  Translates HiveQL query to MR / Tez / Spark job Ÿ  Processes HDFS data Ÿ  Supports UDFs and UDAFs
  25. 25. 25Pivotal Confidential–Internal Use Only Hadoop Hive
  26. 26. 26Pivotal Confidential–Internal Use Only Hadoop Works well for Ÿ  Write Once Read Many Ÿ  100+ server clusters Ÿ  Both relational and non-relational data Ÿ  High concurrency Ÿ  Batch processing and analytical workload Ÿ  Elastic scalability
  27. 27. 27Pivotal Confidential–Internal Use Only Hadoop Not the best choice for Ÿ  Write-heavy workloads Ÿ  Small clusters Ÿ  Analytical DWH cases Ÿ  OLTP and event stream processing Ÿ  Cost savings
  28. 28. 28Pivotal Confidential–Internal Use Only Agenda Ÿ  Distributed Systems Ÿ  MPP Ÿ  Hadoop Ÿ  MPP vs Hadoop Ÿ  Summary
  29. 29. 29Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open
  30. 30. 30Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity
  31. 31. 31Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common
  32. 32. 32Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K
  33. 33. 33Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High
  34. 34. 34Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source
  35. 35. 35Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex
  36. 36. 36Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers
  37. 37. 37Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers Scalability Up to 100-300 TB Up to 100 PB
  38. 38. 38Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers Scalability Up to 100-300 TB Up to 100 PB Target Systems DWH Purpose-Built Batch
  39. 39. 39Pivotal Confidential–Internal Use Only MPP vs Hadoop for Business MPP Hadoop Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers Scalability Up to 100-300 TB Up to 100 PB Target Systems DWH Purpose-Built Batch Target End Users Business Analysts Developers
  40. 40. 40Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None
  41. 41. 41Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None Debugging Easy Very Hard
  42. 42. 42Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java
  43. 43. 43Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High
  44. 44. 44Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High
  45. 45. 45Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec
  46. 46. 46Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins
  47. 47. 47Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins Query Max Runtime 1-2 hours 1-2 weeks
  48. 48. 48Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins Query Max Runtime 1-2 hours 1-2 weeks Min Collection Size Megabytes Gigabytes
  49. 49. 49Pivotal Confidential–Internal Use Only MPP vs Hadoop for Architect MPP Hadoop Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins Query Max Runtime 1-2 hours 1-2 weeks Min Collection Size Megabytes Gigabytes Max Concurrency 10-15 queries 70-100 jobs
  50. 50. 50Pivotal Confidential–Internal Use Only Agenda Ÿ  Distributed Systems Ÿ  MPP Ÿ  Hadoop Ÿ  MPP vs Hadoop Ÿ  Examples Ÿ  Summary
  51. 51. 51Pivotal Confidential–Internal Use Only Summary Use MPP for Ÿ  Analytical DWH Ÿ  Ad hoc analyst SQL queries and BI Ÿ  Keep under 100TB of data Use Hadoop for Ÿ  Specialized data processing systems Ÿ  Over 100TB of data
  52. 52. 52Pivotal Confidential–Internal Use Only 52Pivotal Confidential–Internal Use Only Questions?
  53. 53. BUILT FOR THE SPEED OF BUSINESS

×