The document provides an overview of big data and Hadoop, discussing what big data is, current trends and challenges, approaches to solving big data problems including distributed computing, NoSQL, and Hadoop, and introduces HDFS and the MapReduce framework in Hadoop for distributed storage and processing of large datasets.
1. Big Data and Hadoop Overview
Saurabh Khanna
Mob: +91-8147644946
2. Agenda
Introduction to Big Data
Current market trends and challenges of Big Data
Approach to solve Big Data Problems
Introduction to Hadoop
HDFS & Map Reduce
Hadoop Cluster Introduction & Creation
Hadoop Ecosystems
3. Introduction to Big Data
“Big data is a collection of large and complex data sets that it becomes difficult to
process using on-hand database management tools.The challenges include capture,
storage, search, sharing, analysis, and visualization”
Or
Big data is the realization of greater business intelligence by storing, processing, and
analyzing data that was previously ignored due to the limitations of traditional data
management technologies.And it has 3V’s
6. Big Data Source (2/2)
The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
9. Current market trends & challenges of Big Data (1/2)
We’re generating more data than ever
• Financial transactions
• Sensor networks
• Server logs
• Analytic
• e-mail and text messages
• Social media
And we’re generating data faster than ever
• Automation
• Ubiquitous internet connectivity
• User-generated content
For example, every day
• Twitter processes 340 million messages
• Amazon S3 storage adds more than one billion objects
• Facebook users generate 17.7 billion comments and “Likes”
10. Data isValue and we must process it to extract that value
This data has many valuable applications
• Marketing analysis
• Demand forecasting
• Fraud detection
• And many, many more…
Data Access is the Bottleneck
• Although we can process data more quickly but accessing is very slow and this
is true for both reads and writes. For example
• Reading a single 3TB disk takes almost four hours
• We cannot process the data till we’ve read the data
• We’re limited by the speed of a single disk
• We’ll see Hadoop’s solution in a few moments
Disk performance has also increased in the last 15 years but unfortunately, transfer
rates haven’t kept pace with capacity
Year Capacity (GB) Cost per GB (USD) Transfer Rate (MB/S) Disk Read Time
1997 2.1 $157 16.6 126 seconds
2004 2000 $1.05 56.5 59 minutes
2012 3,000 $0.5 210 3 hours,58 minutes
Current market trends & challenges of Big Data (2/2)
11. Approach to solve Big Data problem
Previously explained pain areas lead to below problems
• Large-scale data storage
• Large-scale data analysis
There are following approach we have to solve Big Data problems
• Option 1 - Distributed Computing
• Option 2 - NoSQL
• Option 3 - HDFS
12. Distributed Computing – Option 1
Typical processing pattern
Step 1: Copy input data from storage to compute node
Step 2: Perform necessary processing
Step 3: Copy output data back to storage
This works fine with relatively amounts of data but we have few problems
with this approach
• That is, where step 2 dominates overall runtime
• More time spent copying data than actually processing it
• Getting data to the processors is the bottleneck
• Grows worse as more compute nodes are added
• They’re competing for the same bandwidth
• Compute nodes become starved for data
• It is not fault tolerance
13. NoSQL - Option 2
NoSQL (commonly referred to as "Not Only SQL") represents a completely different
framework of databases that allows for high-performance, agile processing of information
at massive scale. In other words, it is a database infrastructure that has been very well-
adapted to the heavy demands of big data.
NoSQL is referring to non-relational or at least non-SQL database solutions such
as HBase (also a part of the Hadoop ecosystem like Casandra, Mongo DB, Riak, Couch
DB, and many others.
NoSQL centers around the concept of distributed databases, where unstructured data
may be stored across multiple processing nodes, and often across multiple servers.
This distributed architecture allows NoSQL databases to be horizontally scalable; as data
continues to explode, just add more hardware to keep up, with no slowdown in
performance.
The NoSQL distributed database infrastructure has been the solution to handling some
of the biggest data warehouses on the planet – i.e. the likes of Google,Amazon, and the
CIA.
14. Hadoop - Option 3
Hadoop is a software framework for distributed processing of large datasets
across large clusters of computers
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Hadoop is based on a simple data model, any data will fit
Hadoop was started to improve scalability of Apache Nutch
• Nutch is an open source Web search engine.
15. Main Big dataTechnology
Hadoop NoSQL Databases Analytic Databases
Hadoop
• Low cost, reliable
scale-out architecture
• Distributed computing
Proven success in
Fortune 500 companies
• Exploding interest
NoSQL Databases
• Huge horizontal scaling
and high availability
• Highly optimized for
retrieval and appending
• Types
• Document stores
• Key Value stores
• Graph databases
Analytic RDBMS
• Optimized for bulk-load
and fast aggregate query
workloads
• Types
• Column-oriented
• MPP
• OLTP
• In-memory
16. Hadoop ?
“Apache Hadoop is an open-source software framework for storage and large-
scale processing of data-sets on clusters of commodity hardware. Hadoop is an
Apache top-level project being built and used by a global community of contributors
and users.”
Two Google whitepapers had a major influence on this effort
• The Google File System (storage)
• Map Reduce (processing)
17. Design principles of Hadoop
Invented byYahoo (Doug Cutting)
• Process internet scale data (search the web, store the web)
• Save costs - distributed workload on massively parallel system build with large numbers
of inexpensive computers
New way of storing and processing the data:
• Let system handle most of the issues automatically:
• Failures
• Scalability
• Reduce communications
• Distribute data and processing power to where the data is
• Make parallelism part of operating system
• Relatively inexpensive hardware ($2 – 4K)
• Reliability provided though replication
• Large files preferred over small
Bring processing to Data!
Hadoop = HDFS + Map / Reduce infrastructure
18. Search
Yahoo,Amazon, Zvents
Log processing
Facebook,Yahoo, ContextWeb. Joost, Last.fm
Recommendation Systems
Facebook
DataWarehouse
Facebook,AOL
Video and Image Analysis
NewYorkTimes, Eyealike
What is Hadoop used for?
19. Hadoop Users
Banking and financial
• JP Morgan and Chase
• Bank of America
• Commonwealth bank of Australia
Telecom
• China Mobile Corporation
Retail
• E-bay
• Amazon
Manufacturing
• IBM
• ADOBE
Web & Digital Media
• Facebook
• Twitter
• LinkedIn
• NewYorkTimes
20. Why Hadoop?
Handle partial hardware failures without going down:
• If machine fails, we should be switch over to stand by machine
• If disk fails – use RAID or mirror disk
Able to recover on major failures:
• Regular backups
• Logging
• Mirror database at different site
Capability:
• Increase capacity without restarting the whole system (Pure Scale)
• More computing power should equal to faster processing
Result consistency:
• Answer should be consistent (independent of something failing) and returned in
reasonable amount of time
21. Consider the example of facebook,
Facebook data has grown upto 100TB/day by 2013 and in future shall produce data of a much
higher magnitude.
They have many web servers and huge MySql (profile,friends etc.) servers to hold the user
data.
Hadoop solution framework – A practical example (1/2)
22. Now to run various reports on these huge data
For eg: 1) Ratio of men vs. women users for a period.
2) No of users who commented on a particular day.
Soln:
For this requirement they had scripts written in python which uses ETL processes.
But as the size of data increased to this extent these scripts did not work.
Hence their main aim at this point of time was to handle data warehousing and their
home ground solutions were not working.
This is when Hadoop came into the picture..
Hadoop solution framework – A practical example (2/2)
24. HDFS Definition
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware.
HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop
framework.
It has many similarities with existing distributed file systems.
Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop
applications.
HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid computations.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides
high
throughput access to application data and is suitable for applications that have large data sets
HDFS consists of following components (daemons)
• Name Node
• Data Node
• Secondary Name Node
25. HDFS Components(1/2)
Name node:
Name Node, a master server, manages the file system namespace and regulates access to files by
clients. It has following properties.
Meta-data in Memory
• The entire metadata is in main memory
• Types of Metadata
• List of files
• List of Blocks for each file
• List of Data Nodes for each block
• File attributes, e.g. creation time, replication factor
• ATransaction Log
• Records file creations, file deletions. Etc.
Data Node:
Data Nodes, a server where we store our actual data it should be one per node and it has following
properties
• A Block Server
• Stores data in the local file system (e.g. ext3)
• Stores meta-data of a block (e.g. CRC)
• Serves data and meta-data to Clients
• Block Report
• Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
• Forwards data to other specified Data Nodes
26. Secondary Name Node
• It is not used as hot stand-by or mirror node. It is just a failover node is in future release.
• It is used for housekeeping purpose and in case of NN failure we can take data from this
node.
• It use to take backup Name Node periodically
• Memory requirements are the same as Name Node (big)
• Typically on a separate machine in large cluster ( > 10 nodes)
• Directory is same as Name Node except it keeps previous checkpoint version in addition
to current.
• It can be used to restore failed Name Node (just copy current directory to new Name
Node)
HDFS Components(2/2)
29. Introduction to MapReduce Framework
“A programming model for parallel data processing. Hadoop can run map reduce
programs in multiple languages like Java, Python, Ruby and C++.“
Map function:
• Operate on set of key, value pairs
• Map is applied in parallel on input data set
• This produces output keys and list of values for each key depending upon the
functionality
• Mapper output are partitioned per reducer = No. Of reduce task for that job
Reduce function:
Operate on set of key, value pairs
Reduce is then applied in parallel to each group, again producing a collection of
key, values.
No of reducers can be set by the user.
32. Map Reduce Components
JobTracker :
The Job-Tracker is responsible for accepting jobs from clients, dividing
those jobs into tasks, and assigning those tasks to be executed by worker nodes.
TaskTracker :
Task-Tracker process that manages the execution of the tasks currently
assigned to that node. EachTaskTracker has a fixed number of slots for executing
tasks (two maps and two reduces by default).
33. MapReduce co-located with HDFS
Slave node
A
Client submits
MapReduce
job
JobTracker
TaskTracker
Slave node B Slave node
C
TaskTracker TaskTracker
NameNode
JobTracker and NameNode
need not be on same
node
DataNode
TaskTrackers (compute nodes) and DataNodes colocate = high aggregate bandwidth
across cluster
DataNode DataNode
34. Understanding processing in a M/R framework (1/2)
User runs a program on the client computer
Program submits a job to HDFS. Job contains:
• Input data
• Map / Reduce program
• Configuration information
Two types of daemons that control job execution:
• Job Tracker (master node)
• Task Trackers (slave nodes)
Job sent to Job Tracker then Job Tracker communicates with Name Node and assigns parts of
job to Task Trackers (Task Tracker is run on each Data Node)
Task is a single MAP or REDUCE operation over piece of data
Hadoop divides the input to MAP / REDUCE job into equal splits
The Job Tracker knows (from Name Node) which node contains the data, and which other
machines are nearby.
Task processes send heartbeats to Task Tracker and Task Tracker sends heartbeats to the Job
Tracker.
35. Any tasks that did not report in certain time (default is 10 min) assumed to be
failed and it’s JVM will be killed byTask Tracker and reported to the Job
Tracker.
The JobTracker will reschedule any failed tasks (with differentTask Tracker)
If same task failed 4 times all job will fails
AnyTask Tracker reporting high number of failed jobs on particular node will
be blacklist the node (remove metadata from Name Node)
JobTracker maintains and manages the status of each job. Results from failed
tasks will be ignored
Understanding processing in a M/R framework (2/2)
36. Computing parallelism meet data locality
All map tasks are equivalent; so can run in parallel
All reduce tasks can also run in parallel
Input data on HDFS on can be processed independently
Therefore, run map task on whatever data is local (or closest) to a particular
node in HDFS and will be good performance
• For map task assignment, JobT racker has an affinity for a particular
node which has a replica of the input data
• If lots of data does happen to pile up on the same node, nearby
nodes will map instead
And improve recovery from partial failure of servers or storage during the
operation: if one map or reduce task fails, the work can be rescheduled
37. Programming using MapReduce
WordCount is a simple application that counts the number of occurences of each word
in a given input file.
Here we divide the entire code into 3 files
1)Mapper.java
2)Reducer.java
3)Basic.java
38. Mapper.java
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
privateText word = new Text();
public void map(LongWritable key,Text value, OutputCollector<Text,IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
39. For the following standard input mapper does the following
Input: I am working forTCS
TCS is a great company
The Mapper implementation, via the map method, processes one line at a time, as provided
by the specifiedTextInputFormat. It then splits the line into tokens separated by whitespaces,
via the StringTokenizer, and emits a key-value pair of < <word>, 1>.
Output: <I,1>
<am,1>
<working,1>
<for,1>
<TCS,1>
<TCS,1>
<is,1>
<a,1>
<great,1>
<company,1>
40. Sorted mapper output to reducer
Hence, the output of each map is passed through a sorting Algorithm which sorts the output of Map
according to the keys.
Output: <a,1>
<am,1>
<company,1>
<for,1>
<great,1>
<I,1>
<is,1>
<TCS,1>
<TCS,1>
<working,1>
41. Reducer.java
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Reducer extends MapReduceBase implements Reducer<Text, IntWritable,Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
42. Reducer output
The output of Mapper is given to the Reducer, which Sums up the values, which are the
occurrence counts for each key ( i.e. words in this example).
Output: <a,1>
<am,1>
<company,1>
<for,1>
<great,1>
<I,1>
<is,1>
<TCS,2>
<working,1>
43. Basic.java
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Basic extends MapReduceBase implements Reducer<Text, IntWritable,Text, IntWritable> {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Basic.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Mapper.class);
conf.setReducerClass(Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
44. Executing the MapReduce program
1)Compile all the 3 java files which will create 3 .class files
2)Add all 3 .class files into 1 single jar file by writing this command
jar –cvf file_name.jar *.class
3)Now you just need to execute single jar file by writing this command
bin/hadoop jar file_name.jar Basic input_file_name output_file_name
47. Clustering in Hadoop
Clustering in HADOOP can be achieved in the following modes
Local (Standalone) Mode-Used for Debugging:
• By default, Hadoop is configured to run in a non-distributed mode, as a
single Java process. This mode is useful for debugging.
Pseudo-Distributed Mode- Used for Development :
• Hadoop can also be run on a single-node in a pseudo-distributed mode
where all Hadoop daemon runs in a separate Java process
Fully-Distributed Mode- Used for Debugging, Development,
Production :
• In this mode all hadoop daemons will be running on separate nodes and
it is useful for production.
49. Executing a pseudo cluster
Format a new distributed-file system:
$ bin/hadoop namenode -format
Start the hadoop daemons:
$ bin/start-all.sh
Copy the input files into the distributed filesystem:
$ bin/hadoop fs –copyFromLocal input1 input
Run some of the examples provided:
$ bin/hadoop jar hadoop-examples.jar wordcount input output
Examine the output files:
Copy the output files from the distributed file system to the local file system and examine
them:
$ bin/hadoop fs -copyToLocal output output
$ cat output/part-00000
When you're done, stop the daemons with:
$ bin/stop-all.sh
52. Hadoop Ecosystems
Apache Hive
Apache Pig
Apache HBase
Sqoop
Oozie
Hue
Flume
Apache Whirr
Apache
Zookeeper
SQL-like language
and metadata
repository
High-level
language for
expressing data
analysis
programs
The Hadoop
database.
Random, real -
time read/write
access
Highly reliable
distributed
coordination
service
Library for
running Hadoop
in the cloud
Distributed
service for
collecting and
aggregating log
and event data
Browser-based
desktop interface
for interacting
with Hadoop
Server-based
workflow engine
for Hadoop
activities
Integrating
Hadoop with
RDBMS
53. Pig Concepts
What is Pig ?
It is an open-source high-level dataflow system and
introduced by Yahoo
Provides a simple language for queries and data
manipulation, Pig Latin, that is compiled into map-
reduce jobs that are run on Hadoop
Pig Latin combines the high-level data manipulation
constructs of SQL with the procedural programming of
map-reduce
Why is it important?
Companies and organizations like Yahoo, Google and
Microsoft are collecting enormous data sets in the form
of click streams, search logs, and web crawls
Some form of ad-hoc processing and analysis of all of this
information is required
55. Hive Concepts
What is Hive ?
It is an open-source DW solution built on top of Hadoop and
introduced by Facebook
Support SQL-like declarative language called HiveQL which
are compiled into map-reduce jobs executed on Hadoop
Also support custom map-reduce script to be plugged into
query.
Includes a system catalog, Hive Metastore for query
optimizations and data exploration
Why is it important?
It is very easy to learn because of its similar behavior like
SQL.
Built-in user defined functions (UDFs) to manipulate dates,
strings, and other data-mining tools. Hive supports
extending the UDF set to handle use-cases not supported by
built-in functions.
56. Hive execution plan
Clients either via CLI/
JBDC/ODBC
HIVE-QL
Driver
Invoke
Compiler
DAGof
Map-
Reduces
Execution Engine
hadoop
57. Difference between Pig & Hive
Apache Pig and Hive are two projects that layer on top of
Hadoop, and provide a higher-level language for using
Hadoop's MapReduce library
Pig provides a scripting language for describing operations like
reading, filtering, transforming, joining, and writing data.
If Pig is "scripting for Hadoop", then Hive is "SQL queries for
Hadoop".
Apache Hive offers an even more specific and higher-level
language, for querying data by running Hadoop jobs, rather
than directly scripting step-by-step the operation of several
Map Reduce jobs on Hadoop.
Hive is an excellent tool for analysts and business development
types who are accustomed to SQL-like queries and Business
Intelligence systems.
Pig lets users express them in a language not unlike a bash or
perl script.
58. HBase
Apache HBase in a few words:
“HBase is an open-source, distributed, versioned, column-oriented
store modeled after Google's Bigtable”
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning
that the database isn't an RDBMS which supports SQL as its primary
access language, but there are many types of NoSQL databases.
HBase is very much a distributed database. Technically speaking, HBase
is really more a "Data Store" than "Data Base" because it lacks many of
the features you find in an RDBMS, such as typed columns, secondary
indexes, triggers, and advanced query languages, etc.
However, HBase has many features which supports both linear and
modular scaling.
HBase supports an easy to use Java API for programmatic access.
59. Why is it important?
HBase is a Bigtable clone.
It is open source
It has a good community and promise for the future
It is developed on top of and has good integration for the Hadoop
platform, if you are using Hadoop already.
It has a Cascading connector.
No real indexes
Automatic partitioning
Scale linearly and automatically with new nodes
Commodity hardware
Fault tolerance
Batch processing
60. Difference between HBase and Hadoop/HDFS?
HDFS is a distributed file system that is well suited for the storage of
large files. Its documentation states that it is not, however, a general
purpose file system, and does not provide fast individual record lookups
in files.
HBase, on the other hand, is built on top of HDFS and provides fast
record lookups (and updates) for large tables. This can sometimes be a
point of conceptual confusion. HBase internally puts your data in
indexed "StoreFiles" that exist on HDFS for high-speed lookups.