10. HDFS: Fault-tolerant high-bandwidth clustered storage Automatically and transparently route around failure Master (named-node) – Slave architecture Speculatively execute redundant tasks if certain nodes are detected to be slow Move compute to data Lower latency, lower bandwidth Hadoop principles and MapReduce
12. Patented by Google “programming model… for processing and generating large data sets” Allows such programs to be “automatically parallelized and executed on a large cluster” Works with structured and unstructured data Map function processes a key/value pair to generate a set of intermediate key/value pairs Reduce function merges all intermediate values with the same intermediate key MapReduce
14. Example: count word occurences map (String key, String value): //key: document name //value: document contents for each word w in value: EmitIntermediate(w,”1”); reduce (String key, Iterator values): //key: a word //values: a list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));
15. Example: distributed grep map (String key, String value): //key: document name //value: document contents for each line in value: if line.match(pattern) EmitIntermediate(key, line); reduce (String key, Iterator values): //key: document name //values: a list lines for each v in values: Emit(v);
16. Example: URL access frequency map (String key, String value): //key: log name //value: log contents for each line in value: EmitIntermediate(URL(line), “1”); reduce (String key, Iterator values): //key: URL //values: list of counts for each v in values: result+=ParseInt(v); Emit(AsString(result));
17. Example: Reverse web-link graph map (String key, String value): //key: source document name //value: document contents for each link in value: EmitIntermediate(link, key); reduce (String key, Iterator values): //key: each target link //values: list of sources for each v in values: source_list.add(v) Emit(AsPair(key, source_list));
19. Locality Network bandwidth is scarce Compute on local copies which are distributed by HFDS Task Granularity Ratio of Map (M) to Reduce (R) workers Ideally M and R should be much larger than cluster Typically M such that enough tasks for each 64M block, R is a small multiple Eg: 200,000 M for 5,000 R with 2,000 machines Backup Tasks ‘Straggling’ workers Execution Optimization
20. Partitioning Function How to distribute the intermediate results to Reduce workers? Default: hash(key) mod R Eg: hash(Hostname(URL)) mod R Combiner Function Partial merging of data before the reduce step Save bandwidth Eg: Lots of <the,1> in word counting Refinements
21. Ordering Process values and produce ordered results Strong typing Strongly type input/output values Skip bad records Skip consistently failing data Counters Shared counters May be updated from any map/reduce worker Refinements
Notes de l'éditeur
HDFS takes care of details of data partitioningScheduling program executionHandling machine failuresHandling inter-machine communication and data transfers
Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB