Large Scale Data With Hadoop

Large Scale Data with Hadoop Galen Riley and Josh Patterson Presented at DevChatt 2010

Agenda Thinking at Scale Hadoop Architecture Distributed File System MapReduce Programming Model Examples

Data is Big The Data Deluge (2/25/2010) “Eighteen months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each day. Now the amount has increased tenfold.” http://www.economist.com/opinion/displaystory.cfm?story_id=15579717

Data is Big Sensor data collection 128 sensors 37 GB/day 10 bytes/sample, 30 per second Increasing 10x by 2012 http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-source

Disks are Slow Disk Seek, Data Transfer Reading Files Disk seek for every access Buffered reads, locality  still seeking every disk page

Disks are Slow 10ms seek, 10MB/s transfer 1TB file, 100b records, 10kb page10B entries, 1B pages1GB of updates Seek for each update, 1000 days Seek for each page, 100 days Transfer entire TB, 1 day

Disks are Slow IDE drive – 75 MB/sec, 10ms seek SATA drive – 300MB/s, 8.5ms seek SSD – 800MB/s, 2 ms “seek” (1TB = $4k!)

// Sidetrack Observation: transfer speed improves at a greater rate than seek speed Improvement by treating disks like tapes Seek as little as possible in favor of sequential reads Operate at transfer speed http://weblogs.java.net/blog/2008/03/18/disks-have-become-tapes

An Idea: Parallelism 1 drive – 75 MB/sec 16 days for 100TB 1000 drives – 75 GB/sec 22 minutes for 100TB

A Problem: Parallelism is Hard Issues Synchronization Deadlock Limited bandwidth Timing issues Apples v. Oranges, but… MPI Data distribution, communication between nodes done manually by the programmer Considerable effort achieving parallelism compared to actual processing

A Problem: Reliability Computers are complicated Hard drive Power supply Overheating

A Problem: Reliability 1 Machine 3 years mean time between failures 1000 Machines 1 day mean time between failures

Requirements Backup Reliable Partial failure, graceful decline rather than full halt Data recoverability, if a node fails, another picks up its workload Node recoverability, a fixed node can rejoin the group without a full group restart Scalability, adding resources adds load capacity Easy to use

Hadoop: Robust, Cheap, Reliable Apache project, open source Designed for commodity hardware Can lose whole nodes and not lose data Includes MapReduce programming model

Why Commodity Hardware? Single large computer systems are expensive and proprietary High initial costs, plus lock-in with vendor Existing methods do not work at petabyte-scale Solution: Scale “out” instead of “up”

Hadoop Distributed File System Throughput Good, Latency Bad Data Coherency Write-once, read-many access model Files are broken up into blocks Typically 64MB or 128MB block size Each replicated on multiple DataNodes on write Intelligent Client Client can find location of blocks Client accesses data directly from DataNode

Source: http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile&do=get&target=hdfs_dhruba.pdf

HDFS: Performance Robust in the face of multiple machine failures through aggressive replication of data blocks High Performance Checksum of 100 TB in 10 minutes,~166 GB/sec Built to house petabytes of data

MapReduce Simple programming model that abstracts parallel programming complications away from data processing logic Made popular at Google, drives their processing systems, used on 1000s of computers in various clusters Hadoop provides an open source version of MR

Hadoop In The Field Yahoo Facebook Twitter Commercial support available from Cloudera

Hadoop In Your Backyard openPDC project at TVAhttp://openpdc.codeplex.com Cluster is currently: 20 nodes 200TB of physical drive space Used for Cheap, redundant storage Time series data mining

Examples – Word Count Hello, World! Map Input: foofoo bar Output all words in a dataset as:{ key, value } {“foo”, 1}, {“foo”, 1}, {“bar”, 1} Reduce Input:{“foo”, (1, 1)}, {“bar”, (1)} Output:{“foo”, 2}, {“bar”, 1}

Word Count: Mapper public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizeritr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }

Word Count: Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Examples – Stock Analysis Input dataset: Symbol,Date,Open,High,Low,Close GOOG,2010-03-19,555.23,568.00,557.28,560.00 YHOO,2010-03-19,16.62,16.81,16.34,16.44 GOOG,2010-03-18,564.72,568.44,562.96,566.40 YHOO,2010-03-18,16.46,16.57,16.32,16.56 Interested in biggest delta for each stock

Examples – Stock Analysis Map Output {“GOOG”, 10.72}, {“YHOO”, 0.47}, {“GOOG”, 5.48}, {“YHOO”, 0.25} Reduce Input: {“GOOG”, (10.72, 5.48)},{“YHOO”, (0.47, 0.25)} Output:{“GOOG”, 10.72},{“YHOO”, 0.47}

Examples – Time Series Analysis Map: {pointId, Timestamp + 30s of data} Reduce: Data mining! Classify samples based on training dataset Output samples that fall into interesting categories, index in database

Other Stuff Compatibility with Amazon Elastic Cloud HadoopStreaming MapReducewith anything that uses stdin/stdout Hbase, distributed column-store database Pig, data analysis (transforms, filters, etc) Hive, data warehousing infrastructure Mahout, machine learning algorithms

Parting Thoughts “We don't have better algorithms than anyone else. We just have more data.” Peter Norvig Artificial Intelligence: A Modern Approach Chief scientist at Google

Contact Galen Riley http://galenriley.com @TotallyGreat Josh Patterson http://jpatterson.floe.tv @jpatanooga

Large Scale Data With Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Large Scale Data With Hadoop

Similaire à Large Scale Data With Hadoop (20)

Dernier

Dernier (20)

Large Scale Data With Hadoop

Notes de l'éditeur