Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Daniel Abadi HadoopWorld 2010

4 939 vues

Publié le

Daniel Abadi's HadoopWorld 2010 Slides

Publié dans : Technologie
  • Soyez le premier à commenter

Daniel Abadi HadoopWorld 2010

  1. 1. MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010
  2. 2. Brief History of MapReduce <ul><li>Pre-2004: used at Google for many data processing apps, including Web indexing </li></ul><ul><li>2004: paper in academic conference not written in traditional academic style </li></ul><ul><li>2004-2006: Implemented in Nutch </li></ul><ul><li>2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases </li></ul>
  3. 5. Controversy <ul><li>Vast majority of the outrage was about the comparison of the systems </li></ul><ul><li>BUT: </li></ul><ul><ul><li>The line between MapReduce and Hadoop (which comes with HDFS) was blurring </li></ul></ul><ul><ul><li>Hadoop can be used as an alternative to traditional DW implementations built using DBMS software </li></ul></ul>
  4. 7. SIGMOD 2009 Paper <ul><li>Benchmarked Hadoop vs. 2 parallel database systems </li></ul><ul><ul><li>Compared across a variety of dimensions including performance and ease of use </li></ul></ul><ul><ul><li>Measured differences in load and query time for some common data processing tasks </li></ul></ul><ul><ul><li>Used Web analytics benchmark whose goal was to be representative of tasks that: </li></ul></ul><ul><ul><ul><li>Both should excel at </li></ul></ul></ul><ul><ul><ul><li>Hadoop should excel at </li></ul></ul></ul><ul><ul><ul><li>Databases should excel at </li></ul></ul></ul>
  5. 8. Hardware Setup <ul><li>100 node cluster </li></ul><ul><li>Each node </li></ul><ul><ul><li>2.4 GHz Code 2 Duo Processors </li></ul></ul><ul><ul><li>4 GB RAM </li></ul></ul><ul><ul><li>2 250 GB SATA HDs (74 MB/Sec sequential I/O) </li></ul></ul><ul><li>Dual GigE switches, each with 50 nodes </li></ul><ul><ul><li>128 Gbit/sec fabric </li></ul></ul><ul><li>Connected by a 64 Gbit/sec ring </li></ul>
  6. 9. Join Task
  7. 10. UDF Task DBMS clearly doesn’t scale <ul><li>Calculate PageRank over a set of HTML documents </li></ul><ul><li>Performed via a UDF </li></ul>
  8. 11. Benchmark Conclusions <ul><li>Hadoop has many advantages </li></ul><ul><ul><li>Load time much faster </li></ul></ul><ul><ul><li>Significantly easier to install, use </li></ul></ul><ul><ul><li>Better parallelization of UDFs </li></ul></ul><ul><li>Hadoop is consistently less efficient for structured, relational data </li></ul><ul><ul><li>Reasons both fundamental and non-fundamental </li></ul></ul><ul><ul><li>Needs better support for compression and direct operation on compressed data </li></ul></ul><ul><ul><li>Needs better support for indexing </li></ul></ul><ul><ul><li>Needs better support for co-partitioning of datasets </li></ul></ul>
  9. 12. Overall Conclusion <ul><li>MapReduce/Hadoop and parallel databases are clearly complementary </li></ul><ul><li>Use MapReduce if you want to do: </li></ul><ul><ul><li>ETL </li></ul></ul><ul><ul><li>Unstructured data processing </li></ul></ul><ul><ul><li>Deep analysis that is hard to express in SQL </li></ul></ul><ul><li>Use parallel databases for: </li></ul><ul><ul><li>Traditional data warehousing / data marts </li></ul></ul><ul><ul><li>Structured data processing expressible in SQL </li></ul></ul><ul><li>Cloudera agrees! </li></ul>
  10. 19. We’re all in agreement, right?
  11. 20. But Wait! <ul><li>Hadoop can do everything a parallel database can do </li></ul><ul><li>Hadoop has (something resembling) a SQL interface (Hive) </li></ul><ul><li>Many of Hadoop’s performance deficiencies not fundamental </li></ul><ul><ul><li>Result of initial design for unstructured data </li></ul></ul><ul><ul><li>Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads </li></ul></ul><ul><li>Hadoop is free and open source </li></ul><ul><ul><li>(Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary) </li></ul></ul>
  12. 21. People are using Hadoop as a DW <ul><li>Facebook has 12PB data warehouse in Hadoop/Hive </li></ul><ul><ul><li>Adding 10TB per day </li></ul></ul><ul><li>Yahoo’s warehouse is the same order of magnitude </li></ul><ul><ul><li>Recently switched to Hadoop </li></ul></ul>
  13. 22. Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  14. 23. So … <ul><li>Hadoop can do everything that parallel databases can do, but: </li></ul><ul><ul><li>Has better fault tolerance </li></ul></ul><ul><ul><li>Adjusts better to runtime performance fluctuations </li></ul></ul><ul><ul><li>Is more open / cheaper </li></ul></ul><ul><ul><li>Has at least as good scalability (if not better) </li></ul></ul><ul><li>If only we could fix those performance problems on structured data </li></ul><ul><ul><li>HadoopDB! </li></ul></ul>
  15. 24. HadoopDB <ul><li>Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database systems </li></ul><ul><ul><li>Flexible query interface (accepts both SQL and MapReduce) </li></ul></ul><ul><ul><li>Open source (built using open source components) </li></ul></ul>
  16. 25. HadoopDB Architecture
  17. 26. TPC-H Benchmark Results
  18. 27. Fault Tolerance and Cluster Heterogeneity Results
  19. 28. HadoopDB: Current Status <ul><li>Initial open source release over a year ago </li></ul><ul><ul><li>A bunch of new code since then, but not yet put up online </li></ul></ul><ul><ul><li>This new code is available by request </li></ul></ul><ul><li>Expect the next release to be in mid-2011 </li></ul><ul><li>Money available for people who want to help with development (e-mail justin.borgman@yale.edu) </li></ul>
  20. 29. Invisible Loading <ul><li>Data starts in HDFS </li></ul><ul><li>Data is immediately available for processing (immediate gratification paradigm) </li></ul><ul><li>Each MapReduce job causes data movement from HDFS to database systems </li></ul><ul><li>Data is incrementally loaded, sorted, and indexed </li></ul><ul><li>Query performance improves “invisibly” </li></ul>
  21. 30. Conclusions <ul><li>MapReduce and parallel databases are definitely complimentary </li></ul><ul><li>MapReduce and parallel databases are definitely competitive </li></ul><ul><li>HadoopDB is awesome </li></ul>