This document discusses big data and Hadoop. It begins by explaining how data is growing exponentially and defining what big data is. It then introduces Hadoop as an open-source framework for storing and processing big data across clusters of commodity hardware. The rest of the document provides details on the key components of Hadoop, including HDFS for distributed storage, MapReduce for distributed processing, and various related projects like Pig, Hive and HBase that build on Hadoop.
1. Big Data is Here – Hadoop to the Rescue! Shay Sofer, AlphaCSP
2. Today we will: Understand what is BigData Get to know Hadoop Experience some MapReduce magic Persist very large files Learn some nifty tricks On Today's Menu...
4. IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010) 1ZB = 1 Trillion Gigabytes (or: 1,000,000,000,000,000,000,000 bytes = 1021) 60% Growth from 2009 By 2020 – we will reach 35 ZB Facts and Numbers Data is Everywhere
6. 234M Web sites 7M New sites in 2009 New York Stock Exchange – 1 TB of data per day Web 2.0 147M Blogs (and counting…) Twitter – ~12 TB of data per day Facts and Numbers Data is Everywhere
7. 500M users 40M photos per day More than 30billion pieces of content (web links, news stories, blog posts, notes, photo albums etc.) shared each month Facts and Numbers - Facebook Data is Everywhere
8. Big dataare datasets that grow so large that they become awkward to work with using on-hand database management tools Where and how do we store this information? How do we perform analyses on such large datasets? Why are you here? Data is Everywhere
10. Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer Scale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software application Scale-up Vs. Scale-out Data is Everywhere
12. A framework for writing and running distributed applications that process large amount of data. Runs on large clusters of commodity hardware A cluster with hundreds of machine is standard Inspired by Google’s architecture : MapReduce and GFS What is Hadoop? Hadoop
13. Robust - Handles failures of individual nodes Scales linearly Open source A top-level Apache project Why Hadoop? Hadoop
15. Facebook holds the largest known Hadoop storage cluster in the world 2000 machines 12 TB per machine (some has 24 TB) 32 GB of RAM per machine Total of more than 21 Petabytes (1 Petabyte = 1024 Terabytes) Facebook (Again…) Hadoop
16. History Hadoop Apache Nutch – Open Source web search engine founded by Doug Cutting Cutting joins Yahoo!, forms Hadoop Sorting 1 TB in 62 seconds 2004 2006 2008 2008 2002 2010 Google’s GFS & MapReduce papers published Creating the longest Pi yet Hadoop hits web scale, being used by Yahoo! for web indexing
20. A programming model for processing and generating large data sets Introduced by Google Parallel processing of the map/reduce operations Definition MapReduce
21. Sam believed “An apple a day keeps a doctor away” MapReduce – The Story of Sam Mother Sam An Apple Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
22.
23.
24.
25. But, Sam had just ONE and ONE Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
26. MapReduce – The Story of Sam Sam Implemented a parallelversion of his innovation Each map input: list of <key, value> pairs Fruits (<a, > , <o, > , <p ,> , …) Map Each map output: list of <key, value> pairs (<a’ , > , <o’, v > , <p’ , > , …) Grouped by key (shuffle) Each reduce input: <key, value-list> e.g. <a’, ( …)> Reduce Reduced into a list of values Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
27. Mapper- Takes a series of key/value pairs, processes each and generates output key/value pairs (k1, v1) list(k2, v2) Reducer- Iterates through the values that are associated with a specific key and generate output (k2, list (v2)) list(k3, v3) The Mapper takes the input data, filters and transforms into something The Reducercan aggregate over First Map, Then Reduce MapReduce
29. Hadoop comes with a number of predefined classes BooleanWritable ByteWritable LongWritable Text, etc… Supports pluggable serialization frameworks Apache Avro Hadoop Data Types MapReduce
30. TextInputFormat / TextOutputFormat KeyValueTextInputFormat SequenceFile - A Hadoopspecific compressed binary file format. Optimized for passing data between 2 MapReduce jobs Input / Output Formats MapReduce
31. publicstaticclass MapClass extends MapReduceBase privateText word = new Text(); publicvoid map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, …){ String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while(itr.hasMoreTokens()){ word.set(itr.nextToken()); output.collect(word,newIntWritable(1)); } } } Word Count – The Mapper implements Mapper<LongWritable,Text,Text,IntWritable> < Hello, 1> < World, 1> < Bye, 1> < World, 1> <K1,Hello World Bye World>
34. Music discovery website Scrobbling / Streaming VIA radio 40M unique visitors per month Over 40M scrobbles per day Each scrobble creates a log line Hadoop @ Last.FM MapReduce
35.
36. Goal : Create a “Unique listeners per track” chart Sample listening data MapReduce
37. publicvoid map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable> output, Reporter reporter) throwsIOException { intscrobbles, radioListens; // assume they are initialized - IntWritabletrackId,userId; // for verbosity // if track somehow is marked with zero plays - ignore if (scrobbles <= 0 && radioListens <= 0) { return; } // output user id against track id output.collect(trackId, userId); } Unique Listens - Mapper
38. publicvoid reduce(IntWritabletrackId, Iterator<IntWritable> values, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throwsIOException { Set<Integer> usersSet = newHashSet<Integer>(); // add all userIds to the set, duplicates removed while (values.hasNext()) { IntWritableuserId = values.next(); usersSet.add(userId.get()); } // output: trackId -> number of unique listeners per track output.collect(trackId, newIntWritable(usersSet.size())); } Unique Listens - Reducer
39. Complex tasks will sometimes be needed to be broken down to subtasks Output of the previous job goes as input to the next job job-a | job-b | job-c Simply launch the driver of the 2nd job after the 1st Chaining MapReduce
40. Hadoop supports other languages via API called Streaming Use UNIX commands as mappers and reducers Or use any script that processes line-oriented data stream from STDIN and outputs to STDOUT Python, Perl etc. Hadoop Streaming MapReduce
43. A large dataset can and will outgrow the storage capacity of a single physical machine Partition it across separate machines – Distributed FileSystems Network based - complex What happens when a node fails? Distributed FileSystem HDFS
44. Designed for storing very large files running on clusters on commodity hardware Highly fault-tolerant (via replication) A typical file is gigabytes to terabytes in size High throughput HDFS - Hadoop Distributed FileSystem HDFS
45. Running Hadoop = Running a set of daemons on different servers in your network NameNode DataNode Secondary NameNode JobTracker TaskTracker Hadoop’s Building Blocks HDFS
46. Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker
47. HDFS has a master/slave architecture ; The NameNode acts as the master Single NameNode per HDFS Keeps track of : How the files are broken into blocks Which nodes store those blocks The overall health of the filesystem Memory and I/O intensive The NameNode HDFS
48. Each slave machine will host a DataNode daemon Serves read/write/delete requests from the NameNode Manages the storage attached to the nodes Sends a periodic Heartbeat to the NameNode The DataNode HDFS
49. Failure is the norm rather than exception Detection of faults and quick, automatic recovery Each file is stored as a sequence of blocks (default: 64MB each) The blocks of a file are replicated for fault tolerance Block size and replicas are configurable per file Fault Tolerance - Replication HDFS
51. Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker
52. Assistant daemon that should be on a dedicated node Takes snapshots of the HDFS metadata Doesn’t receive real time changes Helps minimizing downtime incase the NameNode crashes Secondary NameNode HDFS
53. Topology of a Hadoop Cluster Secondary NameNode NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker
54. One per cluster - on the master node Receives job request submitted by the client Schedules and monitors MapReduce jobs on TaskTrackers JobTracker HDFS
55. Run map and reduce tasks Send progress reports to the JobTracker TaskTracker HDFS
56. VIA file commands $ hadoopfs -mkdir /user/chuck $ hadoopfs -put hugeFile.txt $ hadoopfs -get anotherHugeFile.txt Programmatically (HDFS API) FileSystem hdfs = FileSystem.get(new Configuration()); FSDataOutStream out = hdfs.create(filePath); while(...){ out.write(buffer,0,bytesRead); } Working with HDFS HDFS
59. Monitoring events in the cluster can prove to be a bit more difficult Web interface for our cluster Shows a summary of the cluster Details about list of jobs there are currently running, completed and failed Tip #2: JobTracker UI Tips & Tricks
61. Digging through logs or…. Running again the exact same scenario with the same input on the same node? IsolationRunner can rerun the failed task to reproduce the problem Attach a debugger Keep.failed.tasks.file= true Tip #3: IsolationRunner – Hadoop’s Time Machine Tips & Tricks
62. Output of the map phase (which will be shuffled across the network) can be quite large Built in support for compression Different codecs : gzip, bzip2 etc Transparent to the developer conf.setCompressMapOutput(true); conf.setMapOutputCompressorClass(GzipCodec.class); Tip #4: Compression Tips & Tricks
63. A node can experience a slowdown, thus slowing down the entire job If a task is identified as “slow”, it will be scheduled to run in another node in parallel As soon as one finishes successfully, the others will be killed An optimization – not a feature Tip #5: Speculative Execution Tips & Tricks
64. Input can come from 2 (or more) different sources Hadoop has a contrib package called datajoin Generic framework for performing reduce-side join Tip #6: DataJoin Package MapReduce
66. Cloud computing - Shared resources and information are provided on demand Rent a cluster rather than buy it The best known infrastructure for cloud computing is Amazon Web Services (AWS) Launched at July 2002 Cloud Computing and AWS Hadoop in the Cloud
67. Elastic Compute Cloud (EC2) A large farm of VMs where a user can rent and use them to run a computer application Wide range on instance types to choose from (price varies) Simple Storage Service (S3) – Online storage for persisting MapReduce data for future use Hadoop comes with built in support for EC2 and S3 $ hadoop-ec2 launch-cluster <cluster-name> <num-of-slaves> Hadoop in the Cloud – Core Services
71. Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivial Pig simplifies Hadoop programming Provides high-level data processing language : Pig Latin Being used by Yahoo! (70% of production jobs), Twitter, LinkedIn, EBay etc.. Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25 Pig Hadoop-Related Projects
72. Users = LOAD ‘users.csv’ AS (name, age); Fltrd = FILTER Users BYage >= 18 AND age <= 25; Pages = LOAD ‘pages.csv’ AS (user, url); Jnd = JOIN Fltrd BY name, Pages BY user; Grpd = GROUP Jnd BY url; Smmd = FOREACH Grpd GENERATEgroup, COUNT(Jnd) AS clicks; Srtd = ORDER Smmd BY clicks DESC; Top5 = LIMIT Srtd 5; STORE Top5 INTO ‘top5sites.csv’; Pig Latin – Data Flow Language
73. A data warehousing package built on top of Hadoop SQL-like queries on large datasets Hive Hadoop-Related Projects
74. Hadoop database for random read/write access Uses HDFS as the underlying file system Supports billions of rows and millions of columns Facebook chose HBase as a framework for their new version of “Messages” HBase Hadoop-Related Projects
75. A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backports Cloudera Hadoop-Related Projects
77. Big Data can and will cause serious scalability problems to your application MapReduce for analysis, Distributed filesystem for storage Hadoop = MapReduce + HDFS and much more AWS integration is easy Lots of documentation Last words Summary
78. Hadoop in Action / Chuck Lam Hadoop: The Definitive Guide, 2nd Edition / Tom White (O’reilly) Apache Hadoop Documentation Hadoop @ Last.FM Presentation MapReduce in Simple Terms / SaliyaEkanayake Amazon Web Services References
79. Thank you!
Notes de l'éditeur
Lets say we have a huge log file that we need to analyzeOr that we have large amount of data that we need to store – what shall we do?Add more CPUs or memory to single machine (node?)Or add more nodes to the system?
Talk about commodity hardware"Commodity" hardware is hardware that is easily and affordably available. A device that is said to use "commodity hardware" is one that uses components that were previously available or designed and are thus not necessarily unique to that device. Unfortunately, at some point there won’t be a big enoughmachine available for the larger data sets. More importantly, the high-end machinesare not cost effective for many applications. For example, a machine with four timesthe power of a standard PC costs a lot more than putting four such PCs in a cluster.Spreading and dividing the data between many machines will provide a much higher throughput - distributed software application
1. Because it runs on commodity hardware2. With X2 machines it will run close to X2 faster4. Talk about it being mature and very popular (then move to next slide)
http://wiki.apache.org/hadoop/PoweredByStories: New york times – converted 11 million articles from TIFF images to PDFTwitter - We use Hadoop to store and process tweets, log files
Cutting is also the founder of Apache Lucene, popular text search library
Although we’ve spoken about key and values we have yet to discuss their typesSpeak about why java serialization is bad , about pluggable frameworks and about using predefined hadoop.http://tmrp.javaeye.com/blog/552696
Those 2 static classes will reside in a single file. Those inner classses are independent – during job execution the Mapper and Reducer are replicated and run in various nodes in different JVMs
Executing 2 jobs manually is possible, but its more convenient to automate itThe input of the 2nd will be the output of the firstJobClient.runJob() is blocking.
Useful for writing simple short programs that are rapidly developed in scriptsOr writing programs that can take advantage of non-Java libraries
Useful for writing simple short programs that are rapidly developed in scriptsOr writing programs that can take advantage of non-Java libraries
it becomes necessary to partition it across a number of separate machines. Filesystems that managethe storage across a network of machines are called distributed filesystems. Since theyare network-based, all the complications of network programming kick in, thus makingdistributed filesystems more complex than regular disk filesystems. For example, oneof the biggest challenges is making the filesystem tolerate node failure without suffering data loss.
Lets say you have a 100TB in a file – HDFS abstracts the complexity and give you the illusion that you're dealing with a single fileVery large files“Very large” in this context means files that are hundreds of megabytes, gigabytes,or terabytes in size. There are Hadoop clusters running today that store petabytesof data.*Streaming data accessHDFS is built around the idea that the most efficient data processing pattern is awrite-once, read-many-times pattern. A dataset is typically generated or copiedfrom source, then various analyses are performed on that dataset over time. Eachanalysis will involve a large proportion, if not all, of the dataset, so the time to readthe whole dataset is more important than the latency in reading the first record.Commodity hardwareHadoop doesn’t require expensive, highly reliable hardware to run on. It’s designedto run on clusters of commodity hardware (commonly available hardware availablefrom multiple vendors†) for which the chance of node failure across the cluster ishigh, at least for large clusters. HDFS is designed to carry on working without anoticeable interruption to the user in the face of such failure
Some of them exist only on one server and some across all servers
The most important daemonMake sure that the server hosting the NameNode will not store any data locally or perform any computations for a MapReduce program
Constantly reports to the namenode – informs the namenode of which blocks it is currently storingDatanodes also poll the namenode to provide information regarding local changes as well as receiving instructions to create move or delete blocks
all blocks in a file except the last block are the same size
One important aspect of this design is that the client contacts datanodes directly toretrieve data and is guided by the namenode to the best datanode for each block. Thisdesign allows HDFS to scale to a large number of concurrent clients, since the datatraffic is spread across all the datanodes in the cluster. The namenode meanwhile merelyhas to service block location requests (which it stores in memory, making them veryefficient) and does not, for example, serve data, which would quickly become a bot-tleneck as the number of clients grew.
One per clusterTakes snapshots by communicating with the namenode at interval specified by configuration
Relaunch possibly on a different node – up to num of retries
If JobTrackerfails to receive a message from a TaskTracker it assumes failure and submit the task to other nodes
-help to look for help
Local mode is to assist debugging and create the logicUsually the local mode and the Pseudo do mode will work on a subset of dataPseudo is “cluster of one”Switching is easy
If there are bugs that sometimes cause a task to hand or slow down then relying on the speculative execution to avoid these problems is unwise and won’t work reliably since the same bugs are likely to affect the speculative task
You may have a few large data processing jobs that occasionally take advantageof hundreds of nodes, but those same nodes will sit idle the rest of the time.You may be new to Hadoop and want to get familiar with it first before investing ina dedicated cluster. You may own a startup that needs to conserve cash and wantsto avoid the capital expense of a Hadoop cluster. In these and other situations, itmakes more sense to rent a cluster of machines rather than buy it.You can rent computing and storage services from AWS on demand as your requirement scales.As of this writing, renting a compute unit with the equivalent power of a 1.0 GHz32-bit Opteron with 1.7 GB RAM and 160 GB disk storage costs $0.10(varies if its windows or Unix instance) per hour. Using acluster of 100 such machines for an hour will cost a measly $10!
Supported operating systems onEC2 include more than six variants of Linux, plus Windows Server and OpenSolaris.Other images include one of the operating systems plus pre-installed software, such asdatabase server, Apache HTTP server, Java application server, and others. AWS offerspreconfigured images of Hadoop running on Linux,
Load users, Load pagesFilter by ageJoin by nameGroup on URLCount click, sort clicks , get top 5
1.Under the covers pig turns the tranformations into a series of MapReduce jobs but we as programmersAre unaware of this which allows us to focus on the data rather than the nature of execution2. Pig Latin is a data flow pgoramming language where SQL is a declarative programming languate. Ping Latin program is a tep by step set of perations on an input SQL statements is a a set of statements taken together and produce output3. A script file – also pluginPigPen for eclipse exists
HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns
HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns
HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns
HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns