Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 34 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (19)

Similaire à Hadoop DB (20)

Publicité

Plus par Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), AMIESL (20)

Plus récents (20)

Publicité

Hadoop DB

  1. 1. HadoopDB An An Architectural Hybrid of MapReduce & DBMS Technologies for Analytical Workloads 1 Tilani Gunawardena
  2. 2. Road Map • Motivation • Introduction • Desired Properties • Background & Shortfalls • HadoopDB • Benchmarks • Fault Tolerance • Conclusion • Related Work • References 2
  3. 3. Introduction • Analyzing massive structured data on 1000s of shared-nothing nodes • Shared nothing architecture: • A collection of independent,possibly virtual matchines eact with local disk and local main memory connected together on a high- speed network • Approachs: • Parallel databases • Map/Reduce systems 3
  4. 4. Desired Properties • Performance • A primary characteristic that commercial database systems use to distinguish themselves • A Fault tolerance • Heterogeneus environments • Increasing number of nodes • Difficult homogeneous • Flexible query interface • Usually JDBC or ODBC • UDF mechanism • Desirable SQL and no SQL interfaces 4
  5. 5. Background-PDBMS • Standard relational tables and SQL • Indexing, compression,caching, I/O sharing • Tables partitioned over nodes • Transparent to the user • Meet performance • Needed highly skilled DBA • Flexible query interfaces • UDFs varies accros implementations • Fault tolerance • Not score so well • Assumption: failures are rare • Assumption: dozens of nodes in clusters 5
  6. 6. Background-MapReduce • Satisfies fault tolerance • Works on heterogeneus environment • Drawback: performance • No enhacing performance techniques • Interfaces • Write M/R jobs in multiple languages • SQL not supported directly ( excluding eg: Hive ) 6
  7. 7. • MapReduce (Hadoop) MapReduce is a programming model which specifies: • A map function that processes a key/value pair to generate a set of intermediate key/value pairs, • A reduce function that merges all intermediate values associated with the same intermediate key. • Hadoop • Is a MapReduce implementation for processing large data sets over 1000s of nodes. • Maps and Reduces run independently of each other over blocks of data distributed across a cluster 7
  8. 8. Background-MapReduce 8
  9. 9. Differences between Parallel Databases and MapReduce? 9
  10. 10. 10
  11. 11. HadoopDB 11
  12. 12. HadoopDB • Hadoop as communication layer above multiple nodes running single-node DBMS instances • Full open-source solution : • PostgreSQL as DB layer • Hadoop as communication layer • Hive as translation layer 12
  13. 13. HadoopDB RDBMS Hadoop • Careful layout of data • Job scheduling • Indexing • Task coordination • Sorting • Parallellization • Query optimization • compression 13
  14. 14. Ideas • Main goal: achieve the properties described before • Connect multiple single-datanode systems • Hadoop as the task coordination & network communication layer • Queries parallelized across the nodes using MapReduce framework • Fault tolerant and work in heterogeneus nodes • Parallel databases performance • Query processing in database engine 14
  15. 15. Architecture Background • Data Storage layer (HDFS) • Block structured file system managed by central NameNode • Files broken in blocks and ditributed • Data processing layer (Map/Reduce framework) • Master/slave architecture • Job and Task trackers 15
  16. 16. HadoopDB Components • Database Connector • Catalog • Data Loader • Planner (SMS) 16
  17. 17. Database Connector • Interface between DBMS and TaskTacker • Responsabilities • Connect to the database • Execute the SQL query • Return the results as key-value pairs • Achieved goal • Datasources are similar to datablocks in HDFS 17
  18. 18. Catalog • Maintain information about database • Database location, driver class • Darasets in cluster, replica or partitioning • Catalog stored as xml file in HDFS • Plan to deploy as separated service 18
  19. 19. Data Loader • Responsabilities: • Globally partition the data on given key • Break single node data into chunks • Bulk-loading chunks in single-node databases • Two main components: • Global hasher • Map/Reduce job read from HDS and repartition • Local Hasher • Copies from HDFS to local file system 19
  20. 20. SMS Planner • Extends Hive • Steps • Parser transforms query to (AST)abstract syntax tree • Get table schema information from catalog • Logical plan generator creates query plan • Optimizer breaks up plan to Map or Reduce phases • Executable plan generated for one or more MapReduce jobs • SMS tries to push maximum work to database layer 20
  21. 21. 21
  22. 22. Benchmarking • Environment • Amazon EC2 “large” instances • Each instance • 7,5 GB memory,2 virtual cores,850 GB storage,64 bits Linux Fedora 8 • Systems • Hadoop • 256MB data blocks,1024 MB heap size, 200Mb sort buffer • HadoopDB • Similar to Hadoop conf,PostgreSQL 8.2.5,No compress data • Vertica • Used a cloud edition • All data is compressed • DBMS-X • Comercial parallel row • Run on EC2 (not cloud edition available) 22
  23. 23. Benchmarking • Used data • Http log files, html pages, ranking • Sizes (per node): • 155 millions user visits (~ 20Gigabytes) • 18 millions ranking (~1Gigabyte) • Stored as plain text in HDFS 23
  24. 24. Evaluating HadoopDB • Compare HadoopDB to • 1 Hadoop • 2 Parallel databases (Vertica, DBMS-X) • Features: • 1 Performance: • We expected HadoopDB to approach the performance of parallel databases • 2 Scalability: • We expected HadoopDB to scale as well as Hadoop We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2 clusters of 10, 50, 100 nodes. 24
  25. 25. Benchmark tasks • Data loading • Grep task • Selection task • Aggregation task • Join task • UDF Aggregation task • Fault tolerance and Heterogeneous environment 25
  26. 26. Data Load 26
  27. 27. Queries Result 27
  28. 28. • load -data loads are slower than Hadoop, but faster than parallel databases • runtime - • Structured data-HadoopDB is faster than Hadoop but slower than parallel databases(HadoopDB’s performance is close to parallel databases) • Unstructured data- HadoopDB’s performance matches Hadoop 28
  29. 29. Scalability:Setup • Simple aggregation task - full table scan • Data replicated across 10 nodes • Fault-tolerance: Kill a node halfway • Fluctuation-tolerance: Slow down a node for the entire experiment 29
  30. 30. Scalability:Results • HadoopDB and Hadoop take advantage of runtime acheduling by splitting data into chunks • Parallel databases restart entire query on node failure or wait for the slowest node 30
  31. 31. To Summarize • HadoopDB - a hybrid of DBMS and MapReduce • HadoopDB is close in performance to parallel databases • HadoopDB is able to operate in truly heterogeneous environment and has the fault tolerance of Hadoop environment • Is free and open-source http://hadoopdb.sourceforge.net 31
  32. 32. Related Work • Pig Project at yahoo • SCOPE project at Microsoft • Hive project 32
  33. 33. Future Work • Integration with other open source databases • Full automation of the loading and replication process • Dynamically adjusting fault-tolerance levels based on failure rate 33
  34. 34. Thank You! 34

Notes de l'éditeur

  • Hadoop=open source MapReduceParalle databases===shared nothing RDBMSWhat parallel Databases got right : data partitioning ,indexing,parallel sorts,joins,aggregationInteresting ideas in MRVery flexible can handle almost any data type:records,arrays,imagesRuntime job scheduler & load balanceFault tolerance and straggler handling
  • It is impossible to get homogeneous performance across 100/1000 s of nodes even if the node run on identical h/w or on identical virtual machine
  • Scaling not performance
  • Parallel DBMS- best at ad-hoc analytical queries , substantially faster once data is loaded,but loading the data takes considerably longer who wants to program paralle joins ???MapReduce- very suited for extract,transform ,load tasks ease of use for complex analytic tasks’
  • Basic design idea Multiple, independent, single node databases coordinated by HadoopFault tolerance-scheduling and job tracking implementation from Hadoop,
  • NameNode maintains metadata about the size and location of blocks and their replicasMapReduce Framework ---master-slave architecture. master is a single JobTracker and slaves -TaskTrackersEach job is broken down into Map tasks and Reduce tasksThe JobTracker assigns tasks to TaskTrackers based on locality and load balancinglocality by matching a TaskTracker to Map tasks that process data local to itload-balances by ensuring all available TaskTrackers are assigned tasks
  • AST buildingSemantic analyzer connects to catalogDAG of relational operatorsOptimizer reestructurationConvert plan to M/R jobsDAG in M/R serialized in xml plan
  • SMS planner extends Hive
  • After global partition
  • Small query and large query
  • HadoopDB distinguishes itself from many of the current parallel databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize performance in heterogenous clusters

×