HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters 02/28/11 Xiao Qin Department of Computer Science and Software Engineering Auburn University http://www.eng.auburn.edu/~xqin [email_address] Slides 2-20 are adapted from notes by Subbarao Kambhampati (ASU), Dan Weld (U. Washington), Jeff Dean, Sanjay Ghemawat, (Google, Inc.)

Motivation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Map/Reduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Distributed Grep Very big data Split data Split data Split data Split data grep grep grep grep matches matches matches matches cat All matches

Distributed Word Count Very big data Split data Split data Split data Split data count count count count count count count count merge merged count

Map Reduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Very big data Result M A P R E D U C E Partitioning Function

Map in Lisp (Scheme) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Unary operator Binary operator

Map/Reduce ala Google ,[object Object],[object Object],[object Object],[object Object]

count words in docs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Count, Illustrated ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Grep ,[object Object],[object Object],[object Object],[object Object],[object Object]

Reverse Web-Link Graph ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Model is Widely Applicable MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector / host web access log stats inverted index construction document clustering machine learning statistical machine translation ... ... ...

Implementation Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Execution ,[object Object],[object Object],[object Object],[object Object],[object Object]

Job Processing JobTracker TaskTracker 0 TaskTracker 1 TaskTracker 2 TaskTracker 3 TaskTracker 4 TaskTracker 5 ,[object Object],[object Object],[object Object],[object Object],[object Object],“ grep”

Task Granularity & Pipelining ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

MapReduce outside Google ,[object Object],[object Object],[object Object],Master Slave MapReduce jobtracker tasktracker DFS namenode datanode

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Download Software at: http://www.eng.auburn.edu/~xqin/software/hdfs-hc This HDFS-HC tool was described in our paper - Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010.

Hadoop Overview (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)

One time setup ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hadoop Distributed File System (http://lucene.apache.org/hadoop)

Motivational Example Time (min) Node A (fast) Node B (slow) Node C (slowest) 2x slower 3x slower 1 task/min

The Native Strategy Node A Node B Node C 3 tasks 2 tasks 6 tasks Loading Transferring Processing Time (min)

Our Solution --Reducing data transfer time Node A’ Node B’ Node C’ 3 tasks 2 tasks 6 tasks Loading Transferring Processing Time (min) Node A

Preliminary Results Impact of data placement on performance of grep

Challenges ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Measure Computing Ratios ,[object Object],[object Object],Time Node A Node B Node C 2x slower 3x slower 1 task/min

Steps to Measure Computing Ratios 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes 3.Caculate the least common multiple of these ratios 4. Count the portion of each node Node Response time(s) Ratio # of File Fragments Speed Node A 10 1 6 Fastest Node B 20 2 3 Average Node C 30 3 2 Slowest

Initial Data Distribution Namenode Datanodes File1 6 c ,[object Object],[object Object],C B A Portion 3:2:1 1 2 3 4 5 7 8 9 a b

Data Redistribution ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],1 Namenode 1 2 3 4 5 6 7 8 9 a b c C A C B A B 2 3 4 L1 L2 Portion 3:2:1

Sharing Files among Multiple Applications ,[object Object],[object Object],[object Object]

Experimental Environment Five nodes in a hadoop heterogeneous cluster Node CPU Model CPU(Hz) L1 Cache(KB) Node A Intel core 2 Duo 2*1G=2G 204 Node B Intel Celeron 2.8G 256 Node C Intel Pentium 3 1.2G 256 Node D Intel Pentium 3 1.2G 256 Node E Intel Pentium 3 1.2G 256

Grep and WordCount ,[object Object],[object Object]

Computing ratio for two applications Computing ratio of the five nodes with respective of Grep and Wordcount applications Computing Node Ratios for Grep Ratios for Wordcount Node A 1 1 Node B 2 2 Node C 3.3 5 Node D 3.3 5 Node E 3.3 5

Response time of Grep and wordcount in each Node Application dependence Data size independence

Impact of data placement on performance of Grep

Impact of data placement on performance of WordCount

Conclusion ,[object Object],[object Object]

Future Work ,[object Object],[object Object],[object Object]

Fellowship Program Samuel Ginn College of Engineering at Auburn University ,[object Object],[object Object],[object Object],[object Object]

http://www.eng.auburn.edu/programs/grad-school/fellowship-program/

Download the presentation slides http://www.slideshare.net/xqin74 Google: slideshare Xiao Qin

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Similaire à HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters (20)

Plus de Xiao Qin

Plus de Xiao Qin (20)

Dernier

Dernier (20)

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Notes de l'éditeur