SlideShare une entreprise Scribd logo
1  sur  78
Big Data is Here –  Hadoop to the Rescue! Shay Sofer, AlphaCSP
Today we will: Understand what is BigData Get to know Hadoop Experience some MapReduce magic Persist very large files Learn some nifty tricks On Today's Menu...
Data is Everywhere
IDC : “Total data in the universe : 1.2 Zettabytes”  (May, 2010) 1ZB = 1 Trillion Gigabytes      (or: 1,000,000,000,000,000,000,000 bytes = 1021) 60% Growth from 2009 By 2020 – we will reach 35 ZB Facts and Numbers Data is Everywhere
Facts and Numbers Data is Everywhere Source: www.idc.com
234M Web sites 7M New sites in 2009 New York Stock Exchange – 1 TB of data per day Web 2.0 147M Blogs (and counting…) Twitter – ~12 TB of data per day Facts and Numbers Data is Everywhere
500M users 40M photos per day    More than 30billion pieces of content (web links,          news stories, blog posts, notes, photo albums etc.)   shared each month Facts and Numbers - Facebook Data is Everywhere
Big dataare datasets that grow so large that they become awkward to work with using on-hand database management tools Where and how do we store this information? How do we perform analyses on such large datasets? Why are you here? Data is Everywhere
Scale-up Vs. Scale-out Data is Everywhere
Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer Scale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software application Scale-up Vs. Scale-out Data is Everywhere
Introducing…Hadoop!
A framework for writing and running distributed applications that process large amount of data. Runs on large clusters of commodity hardware A cluster with hundreds of machine is standard Inspired by Google’s architecture : MapReduce and GFS What is Hadoop? Hadoop
Robust - Handles failures of individual nodes Scales linearly Open source  A top-level Apache project Why Hadoop? Hadoop
Hadoop
Facebook holds the largest known Hadoop storage cluster in the world 2000 machines 12 TB per machine (some has 24 TB) 32 GB of RAM per machine Total of more than 21 Petabytes  (1 Petabyte = 1024 Terabytes)  Facebook (Again…) Hadoop
History Hadoop Apache Nutch – Open Source web search engine founded by Doug Cutting Cutting joins Yahoo!, forms Hadoop Sorting 1 TB in 62 seconds 2004 2006 2008 2008 2002 2010 Google’s GFS & MapReduce papers published Creating the longest Pi yet Hadoop hits web scale, being used by Yahoo! for web indexing
Hadoop
IDE Plugin Hadoop
Hadoop and MapReduce
A programming model for processing and generating large data sets Introduced by Google  Parallel processing of the map/reduce operations Definition MapReduce
Sam believed “An apple a day keeps a doctor away” MapReduce – The Story of Sam Mother Sam An Apple Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
Sam thought of “drinking” the apple MapReduce – The Story of Sam ,[object Object],Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
Sam applied his invention to all the fruits he could find in the fruit basket ,[object Object],MapReduce – The Story of Sam A list of values mapped into another list of values, which gets reduced into a single value (                        )  ,[object Object],Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
MapReduce – The Story of Sam Sam got his first job for his talent in making juice Fruits ,[object Object],Largedata and list of values for output ,[object Object]
But, Sam had just ONE          and ONE Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
MapReduce – The Story of Sam Sam Implemented a parallelversion of his innovation  Each map input: list of <key, value> pairs Fruits (<a,    > , <o,    > , <p    ,>  , …) Map Each map output: list of <key, value> pairs (<a’ ,   > , <o’, v  > , <p’ ,    > , …) Grouped by key (shuffle) Each reduce input: <key, value-list> e.g. <a’, (               …)> Reduce Reduced into a list of values Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
Mapper- Takes a series of key/value pairs, processes each and generates output key/value pairs      (k1, v1)          list(k2, v2) Reducer- Iterates through the values that are associated with a specific key and generate output     (k2, list (v2))        list(k3, v3) The Mapper takes the input data, filters and transforms into something The Reducercan aggregate over First Map, Then Reduce MapReduce
MapReduce Shuffle Input
Hadoop comes with a number of predefined classes BooleanWritable ByteWritable LongWritable Text, etc… Supports pluggable serialization frameworks Apache Avro  Hadoop Data Types MapReduce
TextInputFormat / TextOutputFormat KeyValueTextInputFormat SequenceFile - A Hadoopspecific compressed binary file format. Optimized for passing data between 2 MapReduce jobs Input / Output Formats MapReduce
publicstaticclass MapClass extends MapReduceBase privateText word = new Text(); publicvoid map(LongWritable key, Text value, 		     OutputCollector<Text,IntWritable> output, …){ String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while(itr.hasMoreTokens()){ word.set(itr.nextToken()); output.collect(word,newIntWritable(1)); 	}   }   }     Word Count – The Mapper implements Mapper<LongWritable,Text,Text,IntWritable> < Hello, 1> < World, 1> < Bye, 1> < World, 1>  <K1,Hello World Bye World>
publicstaticclassReduceClassextends MapReduceBase publicvoidreduce(Text key, Iterator<IntWritable>     	values, OutputCollector<Text,IntWritable> output,…){ intsum = 0; while(values.hasNext()){ sum += values.next().get();  } output.collect(key, new IntWritable(sum)); { { Word Count– The Reducer implementsReducer<Text,IntWritable,Text,IntWritable>{ < Hello, 1> < World, 2> < Bye, 1>  < Hello, 1> < World, 1> < Bye, 1> < World, 1>
publicstaticvoid main(String[] args){ JobConf job = newJobConf(WordCount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MapClass.class); job.setReducerClass(ReduceClass.class); FileInputFormat.addInputFormat(job ,new Path(args[0])); FileOutputFormat.addOutputFormat(job ,newPath(args[1])); //job.setInputFormat(KeyValueTextInputFormat.class); JobClient.runJob(job); { Word Count – The Driver
Music discovery website Scrobbling / Streaming VIA radio 40M unique visitors per month Over 40M scrobbles per day Each scrobble creates a log line Hadoop @ Last.FM MapReduce
Goal : Create a “Unique listeners per track” chart Sample listening data MapReduce
publicvoid map(LongWritable position, Text rawLine, 				   OutputCollector<IntWritable,IntWritable> output,  			   Reporter reporter) throwsIOException {  intscrobbles, radioListens; // assume they are initialized - IntWritabletrackId,userId; // for verbosity    // if track somehow is marked with zero plays - ignore if (scrobbles <= 0 && radioListens <= 0) { return;  	} // output user id against track id output.collect(trackId, userId);  } Unique Listens - Mapper
publicvoid reduce(IntWritabletrackId,  Iterator<IntWritable> values,  OutputCollector<IntWritable, IntWritable> output,  			Reporter reporter) throwsIOException {   Set<Integer> usersSet = newHashSet<Integer>(); // add all userIds to the set, duplicates removed while (values.hasNext()) { IntWritableuserId = values.next(); usersSet.add(userId.get());   } // output: trackId -> number of unique listeners per track output.collect(trackId, newIntWritable(usersSet.size())); } Unique Listens - Reducer
Complex tasks will sometimes be needed to be broken down to subtasks Output of the previous job goes as input to the next job job-a | job-b | job-c Simply launch the driver of the 2nd job after the 1st Chaining MapReduce
Hadoop supports other languages via API called Streaming Use UNIX commands as mappers and reducers Or use any script that processes line-oriented data stream from STDIN and outputs to STDOUT Python, Perl etc. Hadoop Streaming MapReduce
 $ hadoop jar hadoop-streaming.jar         -input input/myFile.txt       -output output.txt        -mapper myMapper.py       -reducer myReducer.py Hadoop Streaming MapReduce
HDFS Hadoop Distributed File System
A large dataset can and will outgrow the storage capacity of a single physical machine Partition it across separate machines – Distributed FileSystems Network based - complex What happens when a node fails? Distributed FileSystem HDFS
Designed for storing very large files running on clusters on commodity hardware Highly fault-tolerant (via replication) A typical file is gigabytes to terabytes in size High throughput HDFS - Hadoop Distributed FileSystem HDFS
Running Hadoop = Running a set of daemons on different servers in your network NameNode DataNode Secondary NameNode JobTracker TaskTracker Hadoop’s Building Blocks HDFS
Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker
HDFS has a master/slave architecture ; The NameNode acts as the master Single NameNode per HDFS Keeps track of : How the files are broken into blocks Which nodes store those blocks The overall health of the filesystem Memory and I/O intensive The NameNode HDFS
Each slave machine will host a DataNode daemon Serves read/write/delete requests from the NameNode Manages the storage attached to the nodes  Sends a periodic Heartbeat to the NameNode The DataNode HDFS
Failure is the norm rather than exception Detection of faults and quick, automatic recovery Each file is stored as a sequence of blocks (default: 64MB each) The blocks of a file are replicated for fault tolerance Block size and replicas are configurable per file Fault Tolerance - Replication HDFS
HDFS
Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker
Assistant daemon that should be on a dedicated node Takes snapshots of the HDFS metadata Doesn’t receive real time changes Helps minimizing downtime incase the NameNode crashes Secondary NameNode HDFS
Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker
One per cluster - on the master node Receives job request submitted by the client Schedules and monitors MapReduce jobs on TaskTrackers JobTracker HDFS
Run map and reduce tasks Send progress reports to the JobTracker TaskTracker HDFS
VIA file commands $ hadoopfs   -mkdir    /user/chuck $ hadoopfs   -put        hugeFile.txt $ hadoopfs   -get       anotherHugeFile.txt Programmatically (HDFS API) FileSystem hdfs = FileSystem.get(new Configuration()); FSDataOutStream out = hdfs.create(filePath); while(...){     out.write(buffer,0,bytesRead); } Working with HDFS HDFS
Tips & Tricks
Tip #1: Hadoop Configuration Types Tips & Tricks
Monitoring events in the cluster can prove to be a bit more difficult Web interface for our cluster Shows a summary of the cluster Details about list of jobs there are currently running, completed and failed Tip #2: JobTracker UI  Tips & Tricks
WebTracker UI SS Tips & Tricks
Digging through logs or…. 	Running again the exact same scenario with the same input on the same node? IsolationRunner can rerun the failed task to reproduce the problem Attach a debugger  Keep.failed.tasks.file= true Tip #3: IsolationRunner – Hadoop’s Time Machine Tips & Tricks
Output of the map phase (which will be shuffled across the network) can be quite large Built in support for compression Different codecs : gzip, bzip2 etc Transparent to the developer conf.setCompressMapOutput(true); conf.setMapOutputCompressorClass(GzipCodec.class); Tip #4: Compression Tips & Tricks
A node can experience a slowdown, thus slowing down the entire job If a task is identified as “slow”, it will be scheduled to run in another node in parallel As soon as one finishes successfully, the others will be killed An optimization – not a feature Tip #5: Speculative Execution Tips & Tricks
Input can come from 2 (or more) different sources Hadoop has a contrib package called datajoin Generic framework for performing reduce-side join Tip #6: DataJoin Package MapReduce
Hadoop in the Cloud Amazon Web Services
Cloud computing - Shared resources and information are provided on demand Rent a cluster rather than buy it The best known infrastructure for cloud computing is Amazon Web Services (AWS) Launched at July 2002 Cloud Computing and AWS Hadoop in the Cloud
Elastic Compute Cloud (EC2) A large farm of VMs where a user can rent and use them to run a computer application Wide range on instance types to choose from (price varies) Simple Storage Service (S3) – Online storage for persisting MapReduce data for future use Hadoop comes with built in support for EC2 and S3 $ hadoop-ec2  launch-cluster  <cluster-name> <num-of-slaves>  Hadoop in the Cloud – Core Services
EC2 Data Flow HDFS EC2 MapReduce Tasks Our Data
EC2 & S3 Data Flow S3 Our Data HDFS EC2 MapReduce Tasks
Hadoop-Related Projects
Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivial Pig simplifies Hadoop programming Provides high-level data processing language : Pig Latin Being used by Yahoo! (70% of production jobs), Twitter, LinkedIn, EBay etc.. Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25 Pig Hadoop-Related Projects
Users = LOAD ‘users.csv’ AS (name, age); Fltrd = FILTER Users BYage >= 18 AND age <= 25; Pages = LOAD ‘pages.csv’ AS (user, url); Jnd = JOIN Fltrd BY name, Pages BY user; Grpd = GROUP Jnd BY url; Smmd = FOREACH Grpd GENERATEgroup, COUNT(Jnd) AS clicks; Srtd = ORDER Smmd BY clicks DESC; Top5 = LIMIT Srtd 5; STORE Top5 INTO ‘top5sites.csv’; Pig Latin – Data Flow Language
A data warehousing package built on top of Hadoop SQL-like queries on large datasets  Hive Hadoop-Related Projects
Hadoop database for random read/write access Uses HDFS as the underlying file system Supports billions of rows and millions of columns Facebook chose HBase as a framework for their new version of “Messages” HBase Hadoop-Related Projects
A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backports Cloudera Hadoop-Related Projects
Machine learning algorithms for Hadoop Coming up next..  (-: Mahout Hadoop-Related Projects
Big Data can and will cause serious scalability problems to your application MapReduce for analysis, Distributed filesystem for storage Hadoop = MapReduce + HDFS and much more AWS integration is easy Lots of documentation Last words Summary
Hadoop in Action / Chuck Lam Hadoop: The Definitive Guide, 2nd Edition /                  Tom White (O’reilly) Apache Hadoop Documentation Hadoop @ Last.FM Presentation  MapReduce in Simple Terms / SaliyaEkanayake Amazon Web Services References

Contenu connexe

Tendances

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sjHolden Karau
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing EcosystemDatabricks
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...jaxLondonConference
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkDatabricks
 
Clustering In The Wild
Clustering In The WildClustering In The Wild
Clustering In The WildSergio Bossa
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark Summit
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 

Tendances (20)

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Clustering In The Wild
Clustering In The WildClustering In The Wild
Clustering In The Wild
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

En vedette

JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingShay Sofer
 
Deep Learning in Natural Language Processing
Deep Learning in Natural Language ProcessingDeep Learning in Natural Language Processing
Deep Learning in Natural Language ProcessingDavid Dao
 
Fast fulltext search in Ruby, without Java -Groonga, Rroonga and Droonga-
Fast fulltext search in Ruby, without Java -Groonga, Rroonga and Droonga-Fast fulltext search in Ruby, without Java -Groonga, Rroonga and Droonga-
Fast fulltext search in Ruby, without Java -Groonga, Rroonga and Droonga-Hiroshi Yuki
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Understanding the New World of Cognitive Computing
Understanding the New World of Cognitive ComputingUnderstanding the New World of Cognitive Computing
Understanding the New World of Cognitive ComputingDATAVERSITY
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Grigory Sapunov
 

En vedette (9)

JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and Searching
 
Deep Learning in Natural Language Processing
Deep Learning in Natural Language ProcessingDeep Learning in Natural Language Processing
Deep Learning in Natural Language Processing
 
Fast fulltext search in Ruby, without Java -Groonga, Rroonga and Droonga-
Fast fulltext search in Ruby, without Java -Groonga, Rroonga and Droonga-Fast fulltext search in Ruby, without Java -Groonga, Rroonga and Droonga-
Fast fulltext search in Ruby, without Java -Groonga, Rroonga and Droonga-
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Understanding the New World of Cognitive Computing
Understanding the New World of Cognitive ComputingUnderstanding the New World of Cognitive Computing
Understanding the New World of Cognitive Computing
 
Groonga meetup20151129
Groonga meetup20151129Groonga meetup20151129
Groonga meetup20151129
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 

Similaire à Big Data Hadoop Rescue

Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and RubyManohar Amrutkar
 

Similaire à Big Data Hadoop Rescue (20)

Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and Ruby
 

Dernier

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 

Dernier (20)

2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 

Big Data Hadoop Rescue

  • 1. Big Data is Here – Hadoop to the Rescue! Shay Sofer, AlphaCSP
  • 2. Today we will: Understand what is BigData Get to know Hadoop Experience some MapReduce magic Persist very large files Learn some nifty tricks On Today's Menu...
  • 4. IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010) 1ZB = 1 Trillion Gigabytes (or: 1,000,000,000,000,000,000,000 bytes = 1021) 60% Growth from 2009 By 2020 – we will reach 35 ZB Facts and Numbers Data is Everywhere
  • 5. Facts and Numbers Data is Everywhere Source: www.idc.com
  • 6. 234M Web sites 7M New sites in 2009 New York Stock Exchange – 1 TB of data per day Web 2.0 147M Blogs (and counting…) Twitter – ~12 TB of data per day Facts and Numbers Data is Everywhere
  • 7. 500M users 40M photos per day More than 30billion pieces of content (web links, news stories, blog posts, notes, photo albums etc.) shared each month Facts and Numbers - Facebook Data is Everywhere
  • 8. Big dataare datasets that grow so large that they become awkward to work with using on-hand database management tools Where and how do we store this information? How do we perform analyses on such large datasets? Why are you here? Data is Everywhere
  • 9. Scale-up Vs. Scale-out Data is Everywhere
  • 10. Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer Scale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software application Scale-up Vs. Scale-out Data is Everywhere
  • 12. A framework for writing and running distributed applications that process large amount of data. Runs on large clusters of commodity hardware A cluster with hundreds of machine is standard Inspired by Google’s architecture : MapReduce and GFS What is Hadoop? Hadoop
  • 13. Robust - Handles failures of individual nodes Scales linearly Open source A top-level Apache project Why Hadoop? Hadoop
  • 15. Facebook holds the largest known Hadoop storage cluster in the world 2000 machines 12 TB per machine (some has 24 TB) 32 GB of RAM per machine Total of more than 21 Petabytes (1 Petabyte = 1024 Terabytes) Facebook (Again…) Hadoop
  • 16. History Hadoop Apache Nutch – Open Source web search engine founded by Doug Cutting Cutting joins Yahoo!, forms Hadoop Sorting 1 TB in 62 seconds 2004 2006 2008 2008 2002 2010 Google’s GFS & MapReduce papers published Creating the longest Pi yet Hadoop hits web scale, being used by Yahoo! for web indexing
  • 20. A programming model for processing and generating large data sets Introduced by Google Parallel processing of the map/reduce operations Definition MapReduce
  • 21. Sam believed “An apple a day keeps a doctor away” MapReduce – The Story of Sam Mother Sam An Apple Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
  • 22.
  • 23.
  • 24.
  • 25. But, Sam had just ONE and ONE Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
  • 26. MapReduce – The Story of Sam Sam Implemented a parallelversion of his innovation Each map input: list of <key, value> pairs Fruits (<a, > , <o, > , <p ,> , …) Map Each map output: list of <key, value> pairs (<a’ , > , <o’, v > , <p’ , > , …) Grouped by key (shuffle) Each reduce input: <key, value-list> e.g. <a’, ( …)> Reduce Reduced into a list of values Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
  • 27. Mapper- Takes a series of key/value pairs, processes each and generates output key/value pairs (k1, v1) list(k2, v2) Reducer- Iterates through the values that are associated with a specific key and generate output (k2, list (v2)) list(k3, v3) The Mapper takes the input data, filters and transforms into something The Reducercan aggregate over First Map, Then Reduce MapReduce
  • 29. Hadoop comes with a number of predefined classes BooleanWritable ByteWritable LongWritable Text, etc… Supports pluggable serialization frameworks Apache Avro Hadoop Data Types MapReduce
  • 30. TextInputFormat / TextOutputFormat KeyValueTextInputFormat SequenceFile - A Hadoopspecific compressed binary file format. Optimized for passing data between 2 MapReduce jobs Input / Output Formats MapReduce
  • 31. publicstaticclass MapClass extends MapReduceBase privateText word = new Text(); publicvoid map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, …){ String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while(itr.hasMoreTokens()){ word.set(itr.nextToken()); output.collect(word,newIntWritable(1)); } } } Word Count – The Mapper implements Mapper<LongWritable,Text,Text,IntWritable> < Hello, 1> < World, 1> < Bye, 1> < World, 1> <K1,Hello World Bye World>
  • 32. publicstaticclassReduceClassextends MapReduceBase publicvoidreduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,…){ intsum = 0; while(values.hasNext()){ sum += values.next().get(); } output.collect(key, new IntWritable(sum)); { { Word Count– The Reducer implementsReducer<Text,IntWritable,Text,IntWritable>{ < Hello, 1> < World, 2> < Bye, 1> < Hello, 1> < World, 1> < Bye, 1> < World, 1>
  • 33. publicstaticvoid main(String[] args){ JobConf job = newJobConf(WordCount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MapClass.class); job.setReducerClass(ReduceClass.class); FileInputFormat.addInputFormat(job ,new Path(args[0])); FileOutputFormat.addOutputFormat(job ,newPath(args[1])); //job.setInputFormat(KeyValueTextInputFormat.class); JobClient.runJob(job); { Word Count – The Driver
  • 34. Music discovery website Scrobbling / Streaming VIA radio 40M unique visitors per month Over 40M scrobbles per day Each scrobble creates a log line Hadoop @ Last.FM MapReduce
  • 35.
  • 36. Goal : Create a “Unique listeners per track” chart Sample listening data MapReduce
  • 37. publicvoid map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable> output, Reporter reporter) throwsIOException { intscrobbles, radioListens; // assume they are initialized - IntWritabletrackId,userId; // for verbosity // if track somehow is marked with zero plays - ignore if (scrobbles <= 0 && radioListens <= 0) { return; } // output user id against track id output.collect(trackId, userId); } Unique Listens - Mapper
  • 38. publicvoid reduce(IntWritabletrackId, Iterator<IntWritable> values, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throwsIOException { Set<Integer> usersSet = newHashSet<Integer>(); // add all userIds to the set, duplicates removed while (values.hasNext()) { IntWritableuserId = values.next(); usersSet.add(userId.get()); } // output: trackId -> number of unique listeners per track output.collect(trackId, newIntWritable(usersSet.size())); } Unique Listens - Reducer
  • 39. Complex tasks will sometimes be needed to be broken down to subtasks Output of the previous job goes as input to the next job job-a | job-b | job-c Simply launch the driver of the 2nd job after the 1st Chaining MapReduce
  • 40. Hadoop supports other languages via API called Streaming Use UNIX commands as mappers and reducers Or use any script that processes line-oriented data stream from STDIN and outputs to STDOUT Python, Perl etc. Hadoop Streaming MapReduce
  • 41. $ hadoop jar hadoop-streaming.jar -input input/myFile.txt -output output.txt -mapper myMapper.py -reducer myReducer.py Hadoop Streaming MapReduce
  • 42. HDFS Hadoop Distributed File System
  • 43. A large dataset can and will outgrow the storage capacity of a single physical machine Partition it across separate machines – Distributed FileSystems Network based - complex What happens when a node fails? Distributed FileSystem HDFS
  • 44. Designed for storing very large files running on clusters on commodity hardware Highly fault-tolerant (via replication) A typical file is gigabytes to terabytes in size High throughput HDFS - Hadoop Distributed FileSystem HDFS
  • 45. Running Hadoop = Running a set of daemons on different servers in your network NameNode DataNode Secondary NameNode JobTracker TaskTracker Hadoop’s Building Blocks HDFS
  • 46. Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker
  • 47. HDFS has a master/slave architecture ; The NameNode acts as the master Single NameNode per HDFS Keeps track of : How the files are broken into blocks Which nodes store those blocks The overall health of the filesystem Memory and I/O intensive The NameNode HDFS
  • 48. Each slave machine will host a DataNode daemon Serves read/write/delete requests from the NameNode Manages the storage attached to the nodes Sends a periodic Heartbeat to the NameNode The DataNode HDFS
  • 49. Failure is the norm rather than exception Detection of faults and quick, automatic recovery Each file is stored as a sequence of blocks (default: 64MB each) The blocks of a file are replicated for fault tolerance Block size and replicas are configurable per file Fault Tolerance - Replication HDFS
  • 50. HDFS
  • 51. Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker
  • 52. Assistant daemon that should be on a dedicated node Takes snapshots of the HDFS metadata Doesn’t receive real time changes Helps minimizing downtime incase the NameNode crashes Secondary NameNode HDFS
  • 53. Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker
  • 54. One per cluster - on the master node Receives job request submitted by the client Schedules and monitors MapReduce jobs on TaskTrackers JobTracker HDFS
  • 55. Run map and reduce tasks Send progress reports to the JobTracker TaskTracker HDFS
  • 56. VIA file commands $ hadoopfs -mkdir /user/chuck $ hadoopfs -put hugeFile.txt $ hadoopfs -get anotherHugeFile.txt Programmatically (HDFS API) FileSystem hdfs = FileSystem.get(new Configuration()); FSDataOutStream out = hdfs.create(filePath); while(...){ out.write(buffer,0,bytesRead); } Working with HDFS HDFS
  • 58. Tip #1: Hadoop Configuration Types Tips & Tricks
  • 59. Monitoring events in the cluster can prove to be a bit more difficult Web interface for our cluster Shows a summary of the cluster Details about list of jobs there are currently running, completed and failed Tip #2: JobTracker UI Tips & Tricks
  • 60. WebTracker UI SS Tips & Tricks
  • 61. Digging through logs or…. Running again the exact same scenario with the same input on the same node? IsolationRunner can rerun the failed task to reproduce the problem Attach a debugger Keep.failed.tasks.file= true Tip #3: IsolationRunner – Hadoop’s Time Machine Tips & Tricks
  • 62. Output of the map phase (which will be shuffled across the network) can be quite large Built in support for compression Different codecs : gzip, bzip2 etc Transparent to the developer conf.setCompressMapOutput(true); conf.setMapOutputCompressorClass(GzipCodec.class); Tip #4: Compression Tips & Tricks
  • 63. A node can experience a slowdown, thus slowing down the entire job If a task is identified as “slow”, it will be scheduled to run in another node in parallel As soon as one finishes successfully, the others will be killed An optimization – not a feature Tip #5: Speculative Execution Tips & Tricks
  • 64. Input can come from 2 (or more) different sources Hadoop has a contrib package called datajoin Generic framework for performing reduce-side join Tip #6: DataJoin Package MapReduce
  • 65. Hadoop in the Cloud Amazon Web Services
  • 66. Cloud computing - Shared resources and information are provided on demand Rent a cluster rather than buy it The best known infrastructure for cloud computing is Amazon Web Services (AWS) Launched at July 2002 Cloud Computing and AWS Hadoop in the Cloud
  • 67. Elastic Compute Cloud (EC2) A large farm of VMs where a user can rent and use them to run a computer application Wide range on instance types to choose from (price varies) Simple Storage Service (S3) – Online storage for persisting MapReduce data for future use Hadoop comes with built in support for EC2 and S3 $ hadoop-ec2 launch-cluster <cluster-name> <num-of-slaves> Hadoop in the Cloud – Core Services
  • 68. EC2 Data Flow HDFS EC2 MapReduce Tasks Our Data
  • 69. EC2 & S3 Data Flow S3 Our Data HDFS EC2 MapReduce Tasks
  • 71. Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivial Pig simplifies Hadoop programming Provides high-level data processing language : Pig Latin Being used by Yahoo! (70% of production jobs), Twitter, LinkedIn, EBay etc.. Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25 Pig Hadoop-Related Projects
  • 72. Users = LOAD ‘users.csv’ AS (name, age); Fltrd = FILTER Users BYage >= 18 AND age <= 25; Pages = LOAD ‘pages.csv’ AS (user, url); Jnd = JOIN Fltrd BY name, Pages BY user; Grpd = GROUP Jnd BY url; Smmd = FOREACH Grpd GENERATEgroup, COUNT(Jnd) AS clicks; Srtd = ORDER Smmd BY clicks DESC; Top5 = LIMIT Srtd 5; STORE Top5 INTO ‘top5sites.csv’; Pig Latin – Data Flow Language
  • 73. A data warehousing package built on top of Hadoop SQL-like queries on large datasets Hive Hadoop-Related Projects
  • 74. Hadoop database for random read/write access Uses HDFS as the underlying file system Supports billions of rows and millions of columns Facebook chose HBase as a framework for their new version of “Messages” HBase Hadoop-Related Projects
  • 75. A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backports Cloudera Hadoop-Related Projects
  • 76. Machine learning algorithms for Hadoop Coming up next.. (-: Mahout Hadoop-Related Projects
  • 77. Big Data can and will cause serious scalability problems to your application MapReduce for analysis, Distributed filesystem for storage Hadoop = MapReduce + HDFS and much more AWS integration is easy Lots of documentation Last words Summary
  • 78. Hadoop in Action / Chuck Lam Hadoop: The Definitive Guide, 2nd Edition / Tom White (O’reilly) Apache Hadoop Documentation Hadoop @ Last.FM Presentation MapReduce in Simple Terms / SaliyaEkanayake Amazon Web Services References
  • 79. Thank you!

Notes de l'éditeur

  1. Lets say we have a huge log file that we need to analyzeOr that we have large amount of data that we need to store – what shall we do?Add more CPUs or memory to single machine (node?)Or add more nodes to the system?
  2. Talk about commodity hardware&quot;Commodity&quot; hardware is hardware that is easily and affordably available. A device that is said to use &quot;commodity hardware&quot; is one that uses components that were previously available or designed and are thus not necessarily unique to that device. Unfortunately, at some point there won’t be a big enoughmachine available for the larger data sets. More importantly, the high-end machinesare not cost effective for many applications. For example, a machine with four timesthe power of a standard PC costs a lot more than putting four such PCs in a cluster.Spreading and dividing the data between many machines will provide a much higher throughput - distributed software application
  3. 1. Because it runs on commodity hardware2. With X2 machines it will run close to X2 faster4. Talk about it being mature and very popular (then move to next slide)
  4. http://wiki.apache.org/hadoop/PoweredByStories: New york times – converted 11 million articles from TIFF images to PDFTwitter - We use Hadoop to store and process tweets, log files
  5. http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html
  6. Cutting is also the founder of Apache Lucene, popular text search library
  7. Although we’ve spoken about key and values we have yet to discuss their typesSpeak about why java serialization is bad , about pluggable frameworks and about using predefined hadoop.http://tmrp.javaeye.com/blog/552696
  8. Those 2 static classes will reside in a single file. Those inner classses are independent – during job execution the Mapper and Reducer are replicated and run in various nodes in different JVMs
  9. Executing 2 jobs manually is possible, but its more convenient to automate itThe input of the 2nd will be the output of the firstJobClient.runJob() is blocking.
  10. Useful for writing simple short programs that are rapidly developed in scriptsOr writing programs that can take advantage of non-Java libraries
  11. Useful for writing simple short programs that are rapidly developed in scriptsOr writing programs that can take advantage of non-Java libraries
  12. it becomes necessary to partition it across a number of separate machines. Filesystems that managethe storage across a network of machines are called distributed filesystems. Since theyare network-based, all the complications of network programming kick in, thus makingdistributed filesystems more complex than regular disk filesystems. For example, oneof the biggest challenges is making the filesystem tolerate node failure without suffering data loss.
  13. Lets say you have a 100TB in a file – HDFS abstracts the complexity and give you the illusion that you&apos;re dealing with a single fileVery large files“Very large” in this context means files that are hundreds of megabytes, gigabytes,or terabytes in size. There are Hadoop clusters running today that store petabytesof data.*Streaming data accessHDFS is built around the idea that the most efficient data processing pattern is awrite-once, read-many-times pattern. A dataset is typically generated or copiedfrom source, then various analyses are performed on that dataset over time. Eachanalysis will involve a large proportion, if not all, of the dataset, so the time to readthe whole dataset is more important than the latency in reading the first record.Commodity hardwareHadoop doesn’t require expensive, highly reliable hardware to run on. It’s designedto run on clusters of commodity hardware (commonly available hardware availablefrom multiple vendors†) for which the chance of node failure across the cluster ishigh, at least for large clusters. HDFS is designed to carry on working without anoticeable interruption to the user in the face of such failure
  14. Some of them exist only on one server and some across all servers
  15. The most important daemonMake sure that the server hosting the NameNode will not store any data locally or perform any computations for a MapReduce program
  16. Constantly reports to the namenode – informs the namenode of which blocks it is currently storingDatanodes also poll the namenode to provide information regarding local changes as well as receiving instructions to create move or delete blocks
  17. all blocks in a file except the last block are the same size
  18. One important aspect of this design is that the client contacts datanodes directly toretrieve data and is guided by the namenode to the best datanode for each block. Thisdesign allows HDFS to scale to a large number of concurrent clients, since the datatraffic is spread across all the datanodes in the cluster. The namenode meanwhile merelyhas to service block location requests (which it stores in memory, making them veryefficient) and does not, for example, serve data, which would quickly become a bot-tleneck as the number of clients grew.
  19. One per clusterTakes snapshots by communicating with the namenode at interval specified by configuration
  20. Relaunch possibly on a different node – up to num of retries
  21. If JobTrackerfails to receive a message from a TaskTracker it assumes failure and submit the task to other nodes
  22. -help to look for help
  23. Local mode is to assist debugging and create the logicUsually the local mode and the Pseudo do mode will work on a subset of dataPseudo is “cluster of one”Switching is easy
  24. If there are bugs that sometimes cause a task to hand or slow down then relying on the speculative execution to avoid these problems is unwise and won’t work reliably since the same bugs are likely to affect the speculative task
  25. You may have a few large data processing jobs that occasionally take advantageof hundreds of nodes, but those same nodes will sit idle the rest of the time.You may be new to Hadoop and want to get familiar with it first before investing ina dedicated cluster. You may own a startup that needs to conserve cash and wantsto avoid the capital expense of a Hadoop cluster. In these and other situations, itmakes more sense to rent a cluster of machines rather than buy it.You can rent computing and storage services from AWS on demand as your requirement scales.As of this writing, renting a compute unit with the equivalent power of a 1.0 GHz32-bit Opteron with 1.7 GB RAM and 160 GB disk storage costs $0.10(varies if its windows or Unix instance) per hour. Using acluster of 100 such machines for an hour will cost a measly $10!
  26. Supported operating systems onEC2 include more than six variants of Linux, plus Windows Server and OpenSolaris.Other images include one of the operating systems plus pre-installed software, such asdatabase server, Apache HTTP server, Java application server, and others. AWS offerspreconfigured images of Hadoop running on Linux,
  27. Load users, Load pagesFilter by ageJoin by nameGroup on URLCount click, sort clicks , get top 5
  28. 1.Under the covers pig turns the tranformations into a series of MapReduce jobs but we as programmersAre unaware of this which allows us to focus on the data rather than the nature of execution2. Pig Latin is a data flow pgoramming language where SQL is a declarative programming languate. Ping Latin program is a tep by step set of perations on an input SQL statements is a a set of statements taken together and produce output3. A script file – also pluginPigPen for eclipse exists
  29. HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns
  30. HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns
  31. HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns
  32. HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns