Big Data Hadoop Rescue

Big Data is Here – Hadoop to the Rescue! Shay Sofer, AlphaCSP

Today we will: Understand what is BigData Get to know Hadoop Experience some MapReduce magic Persist very large files Learn some nifty tricks On Today's Menu...

IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010) 1ZB = 1 Trillion Gigabytes (or: 1,000,000,000,000,000,000,000 bytes = 1021) 60% Growth from 2009 By 2020 – we will reach 35 ZB Facts and Numbers Data is Everywhere

Facts and Numbers Data is Everywhere Source: www.idc.com

234M Web sites 7M New sites in 2009 New York Stock Exchange – 1 TB of data per day Web 2.0 147M Blogs (and counting…) Twitter – ~12 TB of data per day Facts and Numbers Data is Everywhere

500M users 40M photos per day More than 30billion pieces of content (web links, news stories, blog posts, notes, photo albums etc.) shared each month Facts and Numbers - Facebook Data is Everywhere

Big dataare datasets that grow so large that they become awkward to work with using on-hand database management tools Where and how do we store this information? How do we perform analyses on such large datasets? Why are you here? Data is Everywhere

Scale-up Vs. Scale-out Data is Everywhere

Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer Scale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software application Scale-up Vs. Scale-out Data is Everywhere

A framework for writing and running distributed applications that process large amount of data. Runs on large clusters of commodity hardware A cluster with hundreds of machine is standard Inspired by Google’s architecture : MapReduce and GFS What is Hadoop? Hadoop

Robust - Handles failures of individual nodes Scales linearly Open source A top-level Apache project Why Hadoop? Hadoop

Facebook holds the largest known Hadoop storage cluster in the world 2000 machines 12 TB per machine (some has 24 TB) 32 GB of RAM per machine Total of more than 21 Petabytes (1 Petabyte = 1024 Terabytes) Facebook (Again…) Hadoop

History Hadoop Apache Nutch – Open Source web search engine founded by Doug Cutting Cutting joins Yahoo!, forms Hadoop Sorting 1 TB in 62 seconds 2004 2006 2008 2008 2002 2010 Google’s GFS & MapReduce papers published Creating the longest Pi yet Hadoop hits web scale, being used by Yahoo! for web indexing

A programming model for processing and generating large data sets Introduced by Google Parallel processing of the map/reduce operations Definition MapReduce

Sam believed “An apple a day keeps a doctor away” MapReduce – The Story of Sam Mother Sam An Apple Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Sam thought of “drinking” the apple MapReduce – The Story of Sam ,[object Object],Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Sam applied his invention to all the fruits he could find in the fruit basket ,[object Object],MapReduce – The Story of Sam A list of values mapped into another list of values, which gets reduced into a single value ( ) ,[object Object],Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

MapReduce – The Story of Sam Sam got his first job for his talent in making juice Fruits ,[object Object],Largedata and list of values for output ,[object Object]

But, Sam had just ONE and ONE Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

MapReduce – The Story of Sam Sam Implemented a parallelversion of his innovation Each map input: list of <key, value> pairs Fruits (<a, > , <o, > , <p ,> , …) Map Each map output: list of <key, value> pairs (<a’ , > , <o’, v > , <p’ , > , …) Grouped by key (shuffle) Each reduce input: <key, value-list> e.g. <a’, ( …)> Reduce Reduced into a list of values Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Mapper- Takes a series of key/value pairs, processes each and generates output key/value pairs (k1, v1) list(k2, v2) Reducer- Iterates through the values that are associated with a specific key and generate output (k2, list (v2)) list(k3, v3) The Mapper takes the input data, filters and transforms into something The Reducercan aggregate over First Map, Then Reduce MapReduce

Hadoop comes with a number of predefined classes BooleanWritable ByteWritable LongWritable Text, etc… Supports pluggable serialization frameworks Apache Avro Hadoop Data Types MapReduce

TextInputFormat / TextOutputFormat KeyValueTextInputFormat SequenceFile - A Hadoopspeciﬁc compressed binary ﬁle format. Optimized for passing data between 2 MapReduce jobs Input / Output Formats MapReduce

publicstaticclass MapClass extends MapReduceBase privateText word = new Text(); publicvoid map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, …){ String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while(itr.hasMoreTokens()){ word.set(itr.nextToken()); output.collect(word,newIntWritable(1)); } } } Word Count – The Mapper implements Mapper<LongWritable,Text,Text,IntWritable> < Hello, 1> < World, 1> < Bye, 1> < World, 1> <K1,Hello World Bye World>

publicstaticclassReduceClassextends MapReduceBase publicvoidreduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,…){ intsum = 0; while(values.hasNext()){ sum += values.next().get(); } output.collect(key, new IntWritable(sum)); { { Word Count– The Reducer implementsReducer<Text,IntWritable,Text,IntWritable>{ < Hello, 1> < World, 2> < Bye, 1> < Hello, 1> < World, 1> < Bye, 1> < World, 1>

publicstaticvoid main(String[] args){ JobConf job = newJobConf(WordCount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MapClass.class); job.setReducerClass(ReduceClass.class); FileInputFormat.addInputFormat(job ,new Path(args[0])); FileOutputFormat.addOutputFormat(job ,newPath(args[1])); //job.setInputFormat(KeyValueTextInputFormat.class); JobClient.runJob(job); { Word Count – The Driver

Music discovery website Scrobbling / Streaming VIA radio 40M unique visitors per month Over 40M scrobbles per day Each scrobble creates a log line Hadoop @ Last.FM MapReduce

Goal : Create a “Unique listeners per track” chart Sample listening data MapReduce

publicvoid map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable> output, Reporter reporter) throwsIOException { intscrobbles, radioListens; // assume they are initialized - IntWritabletrackId,userId; // for verbosity // if track somehow is marked with zero plays - ignore if (scrobbles <= 0 && radioListens <= 0) { return; } // output user id against track id output.collect(trackId, userId); } Unique Listens - Mapper

publicvoid reduce(IntWritabletrackId, Iterator<IntWritable> values, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throwsIOException { Set<Integer> usersSet = newHashSet<Integer>(); // add all userIds to the set, duplicates removed while (values.hasNext()) { IntWritableuserId = values.next(); usersSet.add(userId.get()); } // output: trackId -> number of unique listeners per track output.collect(trackId, newIntWritable(usersSet.size())); } Unique Listens - Reducer

Complex tasks will sometimes be needed to be broken down to subtasks Output of the previous job goes as input to the next job job-a | job-b | job-c Simply launch the driver of the 2nd job after the 1st Chaining MapReduce

Hadoop supports other languages via API called Streaming Use UNIX commands as mappers and reducers Or use any script that processes line-oriented data stream from STDIN and outputs to STDOUT Python, Perl etc. Hadoop Streaming MapReduce

$ hadoop jar hadoop-streaming.jar -input input/myFile.txt -output output.txt -mapper myMapper.py -reducer myReducer.py Hadoop Streaming MapReduce

HDFS Hadoop Distributed File System

A large dataset can and will outgrow the storage capacity of a single physical machine Partition it across separate machines – Distributed FileSystems Network based - complex What happens when a node fails? Distributed FileSystem HDFS

Designed for storing very large files running on clusters on commodity hardware Highly fault-tolerant (via replication) A typical file is gigabytes to terabytes in size High throughput HDFS - Hadoop Distributed FileSystem HDFS

Running Hadoop = Running a set of daemons on different servers in your network NameNode DataNode Secondary NameNode JobTracker TaskTracker Hadoop’s Building Blocks HDFS

Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker

HDFS has a master/slave architecture ; The NameNode acts as the master Single NameNode per HDFS Keeps track of : How the files are broken into blocks Which nodes store those blocks The overall health of the filesystem Memory and I/O intensive The NameNode HDFS

Each slave machine will host a DataNode daemon Serves read/write/delete requests from the NameNode Manages the storage attached to the nodes Sends a periodic Heartbeat to the NameNode The DataNode HDFS

Failure is the norm rather than exception Detection of faults and quick, automatic recovery Each file is stored as a sequence of blocks (default: 64MB each) The blocks of a file are replicated for fault tolerance Block size and replicas are configurable per file Fault Tolerance - Replication HDFS

Assistant daemon that should be on a dedicated node Takes snapshots of the HDFS metadata Doesn’t receive real time changes Helps minimizing downtime incase the NameNode crashes Secondary NameNode HDFS

One per cluster - on the master node Receives job request submitted by the client Schedules and monitors MapReduce jobs on TaskTrackers JobTracker HDFS

Run map and reduce tasks Send progress reports to the JobTracker TaskTracker HDFS

VIA file commands $ hadoopfs -mkdir /user/chuck $ hadoopfs -put hugeFile.txt $ hadoopfs -get anotherHugeFile.txt Programmatically (HDFS API) FileSystem hdfs = FileSystem.get(new Configuration()); FSDataOutStream out = hdfs.create(filePath); while(...){ out.write(buffer,0,bytesRead); } Working with HDFS HDFS

Tip #1: Hadoop Configuration Types Tips & Tricks

Monitoring events in the cluster can prove to be a bit more difficult Web interface for our cluster Shows a summary of the cluster Details about list of jobs there are currently running, completed and failed Tip #2: JobTracker UI Tips & Tricks

WebTracker UI SS Tips & Tricks

Digging through logs or…. Running again the exact same scenario with the same input on the same node? IsolationRunner can rerun the failed task to reproduce the problem Attach a debugger Keep.failed.tasks.file= true Tip #3: IsolationRunner – Hadoop’s Time Machine Tips & Tricks

Output of the map phase (which will be shuffled across the network) can be quite large Built in support for compression Different codecs : gzip, bzip2 etc Transparent to the developer conf.setCompressMapOutput(true); conf.setMapOutputCompressorClass(GzipCodec.class); Tip #4: Compression Tips & Tricks

A node can experience a slowdown, thus slowing down the entire job If a task is identified as “slow”, it will be scheduled to run in another node in parallel As soon as one finishes successfully, the others will be killed An optimization – not a feature Tip #5: Speculative Execution Tips & Tricks

Input can come from 2 (or more) different sources Hadoop has a contrib package called datajoin Generic framework for performing reduce-side join Tip #6: DataJoin Package MapReduce

Hadoop in the Cloud Amazon Web Services

Cloud computing - Shared resources and information are provided on demand Rent a cluster rather than buy it The best known infrastructure for cloud computing is Amazon Web Services (AWS) Launched at July 2002 Cloud Computing and AWS Hadoop in the Cloud

Elastic Compute Cloud (EC2) A large farm of VMs where a user can rent and use them to run a computer application Wide range on instance types to choose from (price varies) Simple Storage Service (S3) – Online storage for persisting MapReduce data for future use Hadoop comes with built in support for EC2 and S3 $ hadoop-ec2 launch-cluster <cluster-name> <num-of-slaves> Hadoop in the Cloud – Core Services

EC2 Data Flow HDFS EC2 MapReduce Tasks Our Data

EC2 & S3 Data Flow S3 Our Data HDFS EC2 MapReduce Tasks

Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivial Pig simplifies Hadoop programming Provides high-level data processing language : Pig Latin Being used by Yahoo! (70% of production jobs), Twitter, LinkedIn, EBay etc.. Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25 Pig Hadoop-Related Projects

Users = LOAD ‘users.csv’ AS (name, age); Fltrd = FILTER Users BYage >= 18 AND age <= 25; Pages = LOAD ‘pages.csv’ AS (user, url); Jnd = JOIN Fltrd BY name, Pages BY user; Grpd = GROUP Jnd BY url; Smmd = FOREACH Grpd GENERATEgroup, COUNT(Jnd) AS clicks; Srtd = ORDER Smmd BY clicks DESC; Top5 = LIMIT Srtd 5; STORE Top5 INTO ‘top5sites.csv’; Pig Latin – Data Flow Language

A data warehousing package built on top of Hadoop SQL-like queries on large datasets Hive Hadoop-Related Projects

Hadoop database for random read/write access Uses HDFS as the underlying file system Supports billions of rows and millions of columns Facebook chose HBase as a framework for their new version of “Messages” HBase Hadoop-Related Projects

A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backports Cloudera Hadoop-Related Projects

Machine learning algorithms for Hadoop Coming up next.. (-: Mahout Hadoop-Related Projects

Big Data can and will cause serious scalability problems to your application MapReduce for analysis, Distributed filesystem for storage Hadoop = MapReduce + HDFS and much more AWS integration is easy Lots of documentation Last words Summary

Hadoop in Action / Chuck Lam Hadoop: The Definitive Guide, 2nd Edition / Tom White (O’reilly) Apache Hadoop Documentation Hadoop @ Last.FM Presentation MapReduce in Simple Terms / SaliyaEkanayake Amazon Web Services References

Big Data Hadoop Rescue

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à Big Data Hadoop Rescue

Similaire à Big Data Hadoop Rescue (20)

Dernier

Dernier (20)

Big Data Hadoop Rescue

Notes de l'éditeur