SlideShare une entreprise Scribd logo
1  sur  62
Big Data and Hadoop Overview
Saurabh Khanna
Mob: +91-8147644946
Agenda
 Introduction to Big Data
 Current market trends and challenges of Big Data
 Approach to solve Big Data Problems
 Introduction to Hadoop
 HDFS & Map Reduce
 Hadoop Cluster Introduction & Creation
 Hadoop Ecosystems
Introduction to Big Data
“Big data is a collection of large and complex data sets that it becomes difficult to
process using on-hand database management tools.The challenges include capture,
storage, search, sharing, analysis, and visualization”
Or
Big data is the realization of greater business intelligence by storing, processing, and
analyzing data that was previously ignored due to the limitations of traditional data
management technologies.And it has 3V’s
Some Make it 4V’s
Big Data Source (1/2)
Big Data Source (2/2)
The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
Big Data Growth
Expectation from Big Data
Current market trends & challenges of Big Data (1/2)
We’re generating more data than ever
• Financial transactions
• Sensor networks
• Server logs
• Analytic
• e-mail and text messages
• Social media
And we’re generating data faster than ever
• Automation
• Ubiquitous internet connectivity
• User-generated content
For example, every day
• Twitter processes 340 million messages
• Amazon S3 storage adds more than one billion objects
• Facebook users generate 17.7 billion comments and “Likes”
 Data isValue and we must process it to extract that value
This data has many valuable applications
• Marketing analysis
• Demand forecasting
• Fraud detection
• And many, many more…
 Data Access is the Bottleneck
• Although we can process data more quickly but accessing is very slow and this
is true for both reads and writes. For example
• Reading a single 3TB disk takes almost four hours
• We cannot process the data till we’ve read the data
• We’re limited by the speed of a single disk
• We’ll see Hadoop’s solution in a few moments
 Disk performance has also increased in the last 15 years but unfortunately, transfer
rates haven’t kept pace with capacity
Year Capacity (GB) Cost per GB (USD) Transfer Rate (MB/S) Disk Read Time
1997 2.1 $157 16.6 126 seconds
2004 2000 $1.05 56.5 59 minutes
2012 3,000 $0.5 210 3 hours,58 minutes
Current market trends & challenges of Big Data (2/2)
Approach to solve Big Data problem
 Previously explained pain areas lead to below problems
• Large-scale data storage
• Large-scale data analysis
There are following approach we have to solve Big Data problems
• Option 1 - Distributed Computing
• Option 2 - NoSQL
• Option 3 - HDFS
Distributed Computing – Option 1
Typical processing pattern
Step 1: Copy input data from storage to compute node
Step 2: Perform necessary processing
Step 3: Copy output data back to storage
This works fine with relatively amounts of data but we have few problems
with this approach
• That is, where step 2 dominates overall runtime
• More time spent copying data than actually processing it
• Getting data to the processors is the bottleneck
• Grows worse as more compute nodes are added
• They’re competing for the same bandwidth
• Compute nodes become starved for data
• It is not fault tolerance
NoSQL - Option 2
 NoSQL (commonly referred to as "Not Only SQL") represents a completely different
framework of databases that allows for high-performance, agile processing of information
at massive scale. In other words, it is a database infrastructure that has been very well-
adapted to the heavy demands of big data.
 NoSQL is referring to non-relational or at least non-SQL database solutions such
as HBase (also a part of the Hadoop ecosystem like Casandra, Mongo DB, Riak, Couch
DB, and many others.
 NoSQL centers around the concept of distributed databases, where unstructured data
may be stored across multiple processing nodes, and often across multiple servers.
 This distributed architecture allows NoSQL databases to be horizontally scalable; as data
continues to explode, just add more hardware to keep up, with no slowdown in
performance.
 The NoSQL distributed database infrastructure has been the solution to handling some
of the biggest data warehouses on the planet – i.e. the likes of Google,Amazon, and the
CIA.
Hadoop - Option 3
 Hadoop is a software framework for distributed processing of large datasets
across large clusters of computers
Large datasets  Terabytes or petabytes of data
Large clusters  hundreds or thousands of nodes
 Hadoop is open-source implementation for Google MapReduce
 Hadoop is based on a simple programming model called MapReduce
 Hadoop is based on a simple data model, any data will fit
 Hadoop was started to improve scalability of Apache Nutch
• Nutch is an open source Web search engine.
Main Big dataTechnology
Hadoop NoSQL Databases Analytic Databases
Hadoop
• Low cost, reliable
scale-out architecture
• Distributed computing
Proven success in
Fortune 500 companies
• Exploding interest
NoSQL Databases
• Huge horizontal scaling
and high availability
• Highly optimized for
retrieval and appending
• Types
• Document stores
• Key Value stores
• Graph databases
Analytic RDBMS
• Optimized for bulk-load
and fast aggregate query
workloads
• Types
• Column-oriented
• MPP
• OLTP
• In-memory
Hadoop ?
“Apache Hadoop is an open-source software framework for storage and large-
scale processing of data-sets on clusters of commodity hardware. Hadoop is an
Apache top-level project being built and used by a global community of contributors
and users.”
Two Google whitepapers had a major influence on this effort
• The Google File System (storage)
• Map Reduce (processing)
Design principles of Hadoop
 Invented byYahoo (Doug Cutting)
• Process internet scale data (search the web, store the web)
• Save costs - distributed workload on massively parallel system build with large numbers
of inexpensive computers
 New way of storing and processing the data:
• Let system handle most of the issues automatically:
• Failures
• Scalability
• Reduce communications
• Distribute data and processing power to where the data is
• Make parallelism part of operating system
• Relatively inexpensive hardware ($2 – 4K)
• Reliability provided though replication
• Large files preferred over small
 Bring processing to Data!
Hadoop = HDFS + Map / Reduce infrastructure
 Search
Yahoo,Amazon, Zvents
 Log processing
Facebook,Yahoo, ContextWeb. Joost, Last.fm
 Recommendation Systems
Facebook
 DataWarehouse
Facebook,AOL
 Video and Image Analysis
NewYorkTimes, Eyealike
What is Hadoop used for?
Hadoop Users
 Banking and financial
• JP Morgan and Chase
• Bank of America
• Commonwealth bank of Australia
 Telecom
• China Mobile Corporation
 Retail
• E-bay
• Amazon
 Manufacturing
• IBM
• ADOBE
 Web & Digital Media
• Facebook
• Twitter
• LinkedIn
• NewYorkTimes
Why Hadoop?
 Handle partial hardware failures without going down:
• If machine fails, we should be switch over to stand by machine
• If disk fails – use RAID or mirror disk
 Able to recover on major failures:
• Regular backups
• Logging
• Mirror database at different site
 Capability:
• Increase capacity without restarting the whole system (Pure Scale)
• More computing power should equal to faster processing
 Result consistency:
• Answer should be consistent (independent of something failing) and returned in
reasonable amount of time
Consider the example of facebook,
Facebook data has grown upto 100TB/day by 2013 and in future shall produce data of a much
higher magnitude.
They have many web servers and huge MySql (profile,friends etc.) servers to hold the user
data.
Hadoop solution framework – A practical example (1/2)
Now to run various reports on these huge data
For eg: 1) Ratio of men vs. women users for a period.
2) No of users who commented on a particular day.
Soln:
For this requirement they had scripts written in python which uses ETL processes.
But as the size of data increased to this extent these scripts did not work.
Hence their main aim at this point of time was to handle data warehousing and their
home ground solutions were not working.
This is when Hadoop came into the picture..
Hadoop solution framework – A practical example (2/2)
Hadoop Distributed File System (HDFS)
Agenda
HDFS Definition
Architecture
HDFS Components
HDFS Definition
 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware.
 HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop
framework.
 It has many similarities with existing distributed file systems.
 Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop
applications.
 HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid computations.
 HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides
high
 throughput access to application data and is suitable for applications that have large data sets
 HDFS consists of following components (daemons)
• Name Node
• Data Node
• Secondary Name Node
HDFS Components(1/2)
 Name node:
Name Node, a master server, manages the file system namespace and regulates access to files by
clients. It has following properties.
 Meta-data in Memory
• The entire metadata is in main memory
• Types of Metadata
• List of files
• List of Blocks for each file
• List of Data Nodes for each block
• File attributes, e.g. creation time, replication factor
• ATransaction Log
• Records file creations, file deletions. Etc.
 Data Node:
Data Nodes, a server where we store our actual data it should be one per node and it has following
properties
• A Block Server
• Stores data in the local file system (e.g. ext3)
• Stores meta-data of a block (e.g. CRC)
• Serves data and meta-data to Clients
• Block Report
• Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
• Forwards data to other specified Data Nodes
 Secondary Name Node
• It is not used as hot stand-by or mirror node. It is just a failover node is in future release.
• It is used for housekeeping purpose and in case of NN failure we can take data from this
node.
• It use to take backup Name Node periodically
• Memory requirements are the same as Name Node (big)
• Typically on a separate machine in large cluster ( > 10 nodes)
• Directory is same as Name Node except it keeps previous checkpoint version in addition
to current.
• It can be used to restore failed Name Node (just copy current directory to new Name
Node)
HDFS Components(2/2)
MapReduce Framework
Agenda
Introduction
Application
Components
Understanding the Processing Logic
Introduction to MapReduce Framework
“A programming model for parallel data processing. Hadoop can run map reduce
programs in multiple languages like Java, Python, Ruby and C++.“
 Map function:
• Operate on set of key, value pairs
• Map is applied in parallel on input data set
• This produces output keys and list of values for each key depending upon the
functionality
• Mapper output are partitioned per reducer = No. Of reduce task for that job
 Reduce function:
 Operate on set of key, value pairs
 Reduce is then applied in parallel to each group, again producing a collection of
key, values.
 No of reducers can be set by the user.
How does a map-reduce algorithm work (1/2)
How does a map-reduce algorithm work (2/2)
Map Reduce Components
 JobTracker :
The Job-Tracker is responsible for accepting jobs from clients, dividing
those jobs into tasks, and assigning those tasks to be executed by worker nodes.
 TaskTracker :
Task-Tracker process that manages the execution of the tasks currently
assigned to that node. EachTaskTracker has a fixed number of slots for executing
tasks (two maps and two reduces by default).
MapReduce co-located with HDFS
Slave node
A
Client submits
MapReduce
job
JobTracker
TaskTracker
Slave node B Slave node
C
TaskTracker TaskTracker
NameNode
JobTracker and NameNode
need not be on same
node
DataNode
TaskTrackers (compute nodes) and DataNodes colocate = high aggregate bandwidth
across cluster
DataNode DataNode
Understanding processing in a M/R framework (1/2)
 User runs a program on the client computer
 Program submits a job to HDFS. Job contains:
• Input data
• Map / Reduce program
• Configuration information
 Two types of daemons that control job execution:
• Job Tracker (master node)
• Task Trackers (slave nodes)
 Job sent to Job Tracker then Job Tracker communicates with Name Node and assigns parts of
job to Task Trackers (Task Tracker is run on each Data Node)
 Task is a single MAP or REDUCE operation over piece of data
 Hadoop divides the input to MAP / REDUCE job into equal splits
 The Job Tracker knows (from Name Node) which node contains the data, and which other
machines are nearby.
 Task processes send heartbeats to Task Tracker and Task Tracker sends heartbeats to the Job
Tracker.
 Any tasks that did not report in certain time (default is 10 min) assumed to be
failed and it’s JVM will be killed byTask Tracker and reported to the Job
Tracker.
 The JobTracker will reschedule any failed tasks (with differentTask Tracker)
 If same task failed 4 times all job will fails
 AnyTask Tracker reporting high number of failed jobs on particular node will
be blacklist the node (remove metadata from Name Node)
 JobTracker maintains and manages the status of each job. Results from failed
tasks will be ignored
Understanding processing in a M/R framework (2/2)
Computing parallelism meet data locality
 All map tasks are equivalent; so can run in parallel
 All reduce tasks can also run in parallel
 Input data on HDFS on can be processed independently
 Therefore, run map task on whatever data is local (or closest) to a particular
node in HDFS and will be good performance
• For map task assignment, JobT racker has an affinity for a particular
node which has a replica of the input data
• If lots of data does happen to pile up on the same node, nearby
nodes will map instead
 And improve recovery from partial failure of servers or storage during the
operation: if one map or reduce task fails, the work can be rescheduled
Programming using MapReduce
WordCount is a simple application that counts the number of occurences of each word
in a given input file.
Here we divide the entire code into 3 files
1)Mapper.java
2)Reducer.java
3)Basic.java
Mapper.java
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
privateText word = new Text();
public void map(LongWritable key,Text value, OutputCollector<Text,IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
For the following standard input mapper does the following
Input: I am working forTCS
TCS is a great company
The Mapper implementation, via the map method, processes one line at a time, as provided
by the specifiedTextInputFormat. It then splits the line into tokens separated by whitespaces,
via the StringTokenizer, and emits a key-value pair of < <word>, 1>.
Output: <I,1>
<am,1>
<working,1>
<for,1>
<TCS,1>
<TCS,1>
<is,1>
<a,1>
<great,1>
<company,1>
Sorted mapper output to reducer
Hence, the output of each map is passed through a sorting Algorithm which sorts the output of Map
according to the keys.
Output: <a,1>
<am,1>
<company,1>
<for,1>
<great,1>
<I,1>
<is,1>
<TCS,1>
<TCS,1>
<working,1>
Reducer.java
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Reducer extends MapReduceBase implements Reducer<Text, IntWritable,Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Reducer output
 The output of Mapper is given to the Reducer, which Sums up the values, which are the
occurrence counts for each key ( i.e. words in this example).
Output: <a,1>
<am,1>
<company,1>
<for,1>
<great,1>
<I,1>
<is,1>
<TCS,2>
<working,1>
Basic.java
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Basic extends MapReduceBase implements Reducer<Text, IntWritable,Text, IntWritable> {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Basic.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Mapper.class);
conf.setReducerClass(Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Executing the MapReduce program
1)Compile all the 3 java files which will create 3 .class files
2)Add all 3 .class files into 1 single jar file by writing this command
jar –cvf file_name.jar *.class
3)Now you just need to execute single jar file by writing this command
bin/hadoop jar file_name.jar Basic input_file_name output_file_name
Hadoop Clusters
Agenda
Cluster Concepts
Installing Hadoop
Creating a pseudo cluster
Clustering in Hadoop
 Clustering in HADOOP can be achieved in the following modes
 Local (Standalone) Mode-Used for Debugging:
• By default, Hadoop is configured to run in a non-distributed mode, as a
single Java process. This mode is useful for debugging.
 Pseudo-Distributed Mode- Used for Development :
• Hadoop can also be run on a single-node in a pseudo-distributed mode
where all Hadoop daemon runs in a separate Java process
 Fully-Distributed Mode- Used for Debugging, Development,
Production :
• In this mode all hadoop daemons will be running on separate nodes and
it is useful for production.
Pseudo-distributed mode configuration
Executing a pseudo cluster
 Format a new distributed-file system:
$ bin/hadoop namenode -format
 Start the hadoop daemons:
$ bin/start-all.sh
 Copy the input files into the distributed filesystem:
$ bin/hadoop fs –copyFromLocal input1 input
 Run some of the examples provided:
$ bin/hadoop jar hadoop-examples.jar wordcount input output
Examine the output files:
 Copy the output files from the distributed file system to the local file system and examine
them:
$ bin/hadoop fs -copyToLocal output output
$ cat output/part-00000
 When you're done, stop the daemons with:
$ bin/stop-all.sh
Questions ?
Hadoop Ecosystems
Agenda
 Pig Concepts
 Hive Concepts
 HBase Concepts
Hadoop Ecosystems
Apache Hive
Apache Pig
Apache HBase
Sqoop
Oozie
Hue
Flume
Apache Whirr
Apache
Zookeeper
SQL-like language
and metadata
repository
High-level
language for
expressing data
analysis
programs
The Hadoop
database.
Random, real -
time read/write
access
Highly reliable
distributed
coordination
service
Library for
running Hadoop
in the cloud
Distributed
service for
collecting and
aggregating log
and event data
Browser-based
desktop interface
for interacting
with Hadoop
Server-based
workflow engine
for Hadoop
activities
Integrating
Hadoop with
RDBMS
Pig Concepts
What is Pig ?
 It is an open-source high-level dataflow system and
introduced by Yahoo
 Provides a simple language for queries and data
manipulation, Pig Latin, that is compiled into map-
reduce jobs that are run on Hadoop
 Pig Latin combines the high-level data manipulation
constructs of SQL with the procedural programming of
map-reduce
Why is it important?
 Companies and organizations like Yahoo, Google and
Microsoft are collecting enormous data sets in the form
of click streams, search logs, and web crawls
 Some form of ad-hoc processing and analysis of all of this
information is required
Pig execution plan
LOAD
FILTER
LOAD
JOIN
GROUP
FOREACH
STORE
Hive Concepts
What is Hive ?
 It is an open-source DW solution built on top of Hadoop and
introduced by Facebook
 Support SQL-like declarative language called HiveQL which
are compiled into map-reduce jobs executed on Hadoop
 Also support custom map-reduce script to be plugged into
query.
 Includes a system catalog, Hive Metastore for query
optimizations and data exploration
Why is it important?
 It is very easy to learn because of its similar behavior like
SQL.
 Built-in user defined functions (UDFs) to manipulate dates,
strings, and other data-mining tools. Hive supports
extending the UDF set to handle use-cases not supported by
built-in functions.
Hive execution plan
Clients either via CLI/
JBDC/ODBC
HIVE-QL
Driver
Invoke
Compiler
DAGof
Map-
Reduces
Execution Engine
hadoop
Difference between Pig & Hive
 Apache Pig and Hive are two projects that layer on top of
Hadoop, and provide a higher-level language for using
Hadoop's MapReduce library
 Pig provides a scripting language for describing operations like
reading, filtering, transforming, joining, and writing data.
 If Pig is "scripting for Hadoop", then Hive is "SQL queries for
Hadoop".
 Apache Hive offers an even more specific and higher-level
language, for querying data by running Hadoop jobs, rather
than directly scripting step-by-step the operation of several
Map Reduce jobs on Hadoop.
 Hive is an excellent tool for analysts and business development
types who are accustomed to SQL-like queries and Business
Intelligence systems.
 Pig lets users express them in a language not unlike a bash or
perl script.
HBase
 Apache HBase in a few words:
“HBase is an open-source, distributed, versioned, column-oriented
store modeled after Google's Bigtable”
 HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning
that the database isn't an RDBMS which supports SQL as its primary
access language, but there are many types of NoSQL databases.
 HBase is very much a distributed database. Technically speaking, HBase
is really more a "Data Store" than "Data Base" because it lacks many of
the features you find in an RDBMS, such as typed columns, secondary
indexes, triggers, and advanced query languages, etc.
 However, HBase has many features which supports both linear and
modular scaling.
 HBase supports an easy to use Java API for programmatic access.
Why is it important?
 HBase is a Bigtable clone.
 It is open source
 It has a good community and promise for the future
 It is developed on top of and has good integration for the Hadoop
platform, if you are using Hadoop already.
 It has a Cascading connector.
 No real indexes
 Automatic partitioning
 Scale linearly and automatically with new nodes
 Commodity hardware
 Fault tolerance
 Batch processing
Difference between HBase and Hadoop/HDFS?
 HDFS is a distributed file system that is well suited for the storage of
large files. Its documentation states that it is not, however, a general
purpose file system, and does not provide fast individual record lookups
in files.
 HBase, on the other hand, is built on top of HDFS and provides fast
record lookups (and updates) for large tables. This can sometimes be a
point of conceptual confusion. HBase internally puts your data in
indexed "StoreFiles" that exist on HDFS for high-speed lookups.
Questions ?
Thank You

Contenu connexe

Tendances

Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Daniel Abadi
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with ExamplesJoe McTee
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 

Tendances (20)

Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 

En vedette

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Machu Picchu
Machu PicchuMachu Picchu
Machu PicchuNickie
 
1. zertarako edo nola erabil daitezke ik tak lehen hezkuntzan
1. zertarako edo nola erabil daitezke ik tak lehen hezkuntzan1. zertarako edo nola erabil daitezke ik tak lehen hezkuntzan
1. zertarako edo nola erabil daitezke ik tak lehen hezkuntzanleirexilban
 
Estructura Curricular1
Estructura Curricular1Estructura Curricular1
Estructura Curricular1guestcef6a3e
 
Présentation 15min
Présentation 15minPrésentation 15min
Présentation 15ming31412
 
Unidade3 recursonovaescola mariagenilde
Unidade3 recursonovaescola mariagenildeUnidade3 recursonovaescola mariagenilde
Unidade3 recursonovaescola mariagenildemgenilde
 
Xerrada guifi.net AVV Barri Segle XX de Terrassa
Xerrada guifi.net AVV Barri Segle XX de TerrassaXerrada guifi.net AVV Barri Segle XX de Terrassa
Xerrada guifi.net AVV Barri Segle XX de Terrassalliurealbir
 
Voor de curator valt veel te winnen bij een overeengekomen verpandingsverbod
Voor de curator valt veel te winnen bij een overeengekomen verpandingsverbodVoor de curator valt veel te winnen bij een overeengekomen verpandingsverbod
Voor de curator valt veel te winnen bij een overeengekomen verpandingsverbodEvert Baart
 
tema de Seminario: lacontaminacion de la tierra causas y consecuencias
tema de Seminario: lacontaminacion de la tierra causas y consecuenciastema de Seminario: lacontaminacion de la tierra causas y consecuencias
tema de Seminario: lacontaminacion de la tierra causas y consecuenciaselizabeth fuentes
 
Clínica Internacional | Hemorroides: síntomas, causas y tratamiento
Clínica Internacional | Hemorroides: síntomas, causas y tratamientoClínica Internacional | Hemorroides: síntomas, causas y tratamiento
Clínica Internacional | Hemorroides: síntomas, causas y tratamientoClínica Internacional
 
Best Practices to Achieve Quality Pressure-Volume Loop Data in Large Animal M...
Best Practices to Achieve Quality Pressure-Volume Loop Data in Large Animal M...Best Practices to Achieve Quality Pressure-Volume Loop Data in Large Animal M...
Best Practices to Achieve Quality Pressure-Volume Loop Data in Large Animal M...InsideScientific
 
IBM's big data seminar programme -moving beyond Hadoop - Ian Radmore, IBM
IBM's big data seminar programme -moving beyond Hadoop - Ian Radmore, IBMIBM's big data seminar programme -moving beyond Hadoop - Ian Radmore, IBM
IBM's big data seminar programme -moving beyond Hadoop - Ian Radmore, IBMInternet World
 

En vedette (20)

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
9 2
9 29 2
9 2
 
Machu Picchu
Machu PicchuMachu Picchu
Machu Picchu
 
1. zertarako edo nola erabil daitezke ik tak lehen hezkuntzan
1. zertarako edo nola erabil daitezke ik tak lehen hezkuntzan1. zertarako edo nola erabil daitezke ik tak lehen hezkuntzan
1. zertarako edo nola erabil daitezke ik tak lehen hezkuntzan
 
Estructura Curricular1
Estructura Curricular1Estructura Curricular1
Estructura Curricular1
 
Présentation 15min
Présentation 15minPrésentation 15min
Présentation 15min
 
Unidade3 recursonovaescola mariagenilde
Unidade3 recursonovaescola mariagenildeUnidade3 recursonovaescola mariagenilde
Unidade3 recursonovaescola mariagenilde
 
Star construction
Star constructionStar construction
Star construction
 
Xerrada guifi.net AVV Barri Segle XX de Terrassa
Xerrada guifi.net AVV Barri Segle XX de TerrassaXerrada guifi.net AVV Barri Segle XX de Terrassa
Xerrada guifi.net AVV Barri Segle XX de Terrassa
 
Voor de curator valt veel te winnen bij een overeengekomen verpandingsverbod
Voor de curator valt veel te winnen bij een overeengekomen verpandingsverbodVoor de curator valt veel te winnen bij een overeengekomen verpandingsverbod
Voor de curator valt veel te winnen bij een overeengekomen verpandingsverbod
 
tema de Seminario: lacontaminacion de la tierra causas y consecuencias
tema de Seminario: lacontaminacion de la tierra causas y consecuenciastema de Seminario: lacontaminacion de la tierra causas y consecuencias
tema de Seminario: lacontaminacion de la tierra causas y consecuencias
 
Clínica Internacional | Hemorroides: síntomas, causas y tratamiento
Clínica Internacional | Hemorroides: síntomas, causas y tratamientoClínica Internacional | Hemorroides: síntomas, causas y tratamiento
Clínica Internacional | Hemorroides: síntomas, causas y tratamiento
 
Best Practices to Achieve Quality Pressure-Volume Loop Data in Large Animal M...
Best Practices to Achieve Quality Pressure-Volume Loop Data in Large Animal M...Best Practices to Achieve Quality Pressure-Volume Loop Data in Large Animal M...
Best Practices to Achieve Quality Pressure-Volume Loop Data in Large Animal M...
 
IBM's big data seminar programme -moving beyond Hadoop - Ian Radmore, IBM
IBM's big data seminar programme -moving beyond Hadoop - Ian Radmore, IBMIBM's big data seminar programme -moving beyond Hadoop - Ian Radmore, IBM
IBM's big data seminar programme -moving beyond Hadoop - Ian Radmore, IBM
 
Anju
AnjuAnju
Anju
 

Similaire à Big data and hadoop overvew

Similaire à Big data and hadoop overvew (20)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Big data and hadoop overvew

  • 1. Big Data and Hadoop Overview Saurabh Khanna Mob: +91-8147644946
  • 2. Agenda  Introduction to Big Data  Current market trends and challenges of Big Data  Approach to solve Big Data Problems  Introduction to Hadoop  HDFS & Map Reduce  Hadoop Cluster Introduction & Creation  Hadoop Ecosystems
  • 3. Introduction to Big Data “Big data is a collection of large and complex data sets that it becomes difficult to process using on-hand database management tools.The challenges include capture, storage, search, sharing, analysis, and visualization” Or Big data is the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies.And it has 3V’s
  • 4. Some Make it 4V’s
  • 6. Big Data Source (2/2) The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 9. Current market trends & challenges of Big Data (1/2) We’re generating more data than ever • Financial transactions • Sensor networks • Server logs • Analytic • e-mail and text messages • Social media And we’re generating data faster than ever • Automation • Ubiquitous internet connectivity • User-generated content For example, every day • Twitter processes 340 million messages • Amazon S3 storage adds more than one billion objects • Facebook users generate 17.7 billion comments and “Likes”
  • 10.  Data isValue and we must process it to extract that value This data has many valuable applications • Marketing analysis • Demand forecasting • Fraud detection • And many, many more…  Data Access is the Bottleneck • Although we can process data more quickly but accessing is very slow and this is true for both reads and writes. For example • Reading a single 3TB disk takes almost four hours • We cannot process the data till we’ve read the data • We’re limited by the speed of a single disk • We’ll see Hadoop’s solution in a few moments  Disk performance has also increased in the last 15 years but unfortunately, transfer rates haven’t kept pace with capacity Year Capacity (GB) Cost per GB (USD) Transfer Rate (MB/S) Disk Read Time 1997 2.1 $157 16.6 126 seconds 2004 2000 $1.05 56.5 59 minutes 2012 3,000 $0.5 210 3 hours,58 minutes Current market trends & challenges of Big Data (2/2)
  • 11. Approach to solve Big Data problem  Previously explained pain areas lead to below problems • Large-scale data storage • Large-scale data analysis There are following approach we have to solve Big Data problems • Option 1 - Distributed Computing • Option 2 - NoSQL • Option 3 - HDFS
  • 12. Distributed Computing – Option 1 Typical processing pattern Step 1: Copy input data from storage to compute node Step 2: Perform necessary processing Step 3: Copy output data back to storage This works fine with relatively amounts of data but we have few problems with this approach • That is, where step 2 dominates overall runtime • More time spent copying data than actually processing it • Getting data to the processors is the bottleneck • Grows worse as more compute nodes are added • They’re competing for the same bandwidth • Compute nodes become starved for data • It is not fault tolerance
  • 13. NoSQL - Option 2  NoSQL (commonly referred to as "Not Only SQL") represents a completely different framework of databases that allows for high-performance, agile processing of information at massive scale. In other words, it is a database infrastructure that has been very well- adapted to the heavy demands of big data.  NoSQL is referring to non-relational or at least non-SQL database solutions such as HBase (also a part of the Hadoop ecosystem like Casandra, Mongo DB, Riak, Couch DB, and many others.  NoSQL centers around the concept of distributed databases, where unstructured data may be stored across multiple processing nodes, and often across multiple servers.  This distributed architecture allows NoSQL databases to be horizontally scalable; as data continues to explode, just add more hardware to keep up, with no slowdown in performance.  The NoSQL distributed database infrastructure has been the solution to handling some of the biggest data warehouses on the planet – i.e. the likes of Google,Amazon, and the CIA.
  • 14. Hadoop - Option 3  Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets  Terabytes or petabytes of data Large clusters  hundreds or thousands of nodes  Hadoop is open-source implementation for Google MapReduce  Hadoop is based on a simple programming model called MapReduce  Hadoop is based on a simple data model, any data will fit  Hadoop was started to improve scalability of Apache Nutch • Nutch is an open source Web search engine.
  • 15. Main Big dataTechnology Hadoop NoSQL Databases Analytic Databases Hadoop • Low cost, reliable scale-out architecture • Distributed computing Proven success in Fortune 500 companies • Exploding interest NoSQL Databases • Huge horizontal scaling and high availability • Highly optimized for retrieval and appending • Types • Document stores • Key Value stores • Graph databases Analytic RDBMS • Optimized for bulk-load and fast aggregate query workloads • Types • Column-oriented • MPP • OLTP • In-memory
  • 16. Hadoop ? “Apache Hadoop is an open-source software framework for storage and large- scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users.” Two Google whitepapers had a major influence on this effort • The Google File System (storage) • Map Reduce (processing)
  • 17. Design principles of Hadoop  Invented byYahoo (Doug Cutting) • Process internet scale data (search the web, store the web) • Save costs - distributed workload on massively parallel system build with large numbers of inexpensive computers  New way of storing and processing the data: • Let system handle most of the issues automatically: • Failures • Scalability • Reduce communications • Distribute data and processing power to where the data is • Make parallelism part of operating system • Relatively inexpensive hardware ($2 – 4K) • Reliability provided though replication • Large files preferred over small  Bring processing to Data! Hadoop = HDFS + Map / Reduce infrastructure
  • 18.  Search Yahoo,Amazon, Zvents  Log processing Facebook,Yahoo, ContextWeb. Joost, Last.fm  Recommendation Systems Facebook  DataWarehouse Facebook,AOL  Video and Image Analysis NewYorkTimes, Eyealike What is Hadoop used for?
  • 19. Hadoop Users  Banking and financial • JP Morgan and Chase • Bank of America • Commonwealth bank of Australia  Telecom • China Mobile Corporation  Retail • E-bay • Amazon  Manufacturing • IBM • ADOBE  Web & Digital Media • Facebook • Twitter • LinkedIn • NewYorkTimes
  • 20. Why Hadoop?  Handle partial hardware failures without going down: • If machine fails, we should be switch over to stand by machine • If disk fails – use RAID or mirror disk  Able to recover on major failures: • Regular backups • Logging • Mirror database at different site  Capability: • Increase capacity without restarting the whole system (Pure Scale) • More computing power should equal to faster processing  Result consistency: • Answer should be consistent (independent of something failing) and returned in reasonable amount of time
  • 21. Consider the example of facebook, Facebook data has grown upto 100TB/day by 2013 and in future shall produce data of a much higher magnitude. They have many web servers and huge MySql (profile,friends etc.) servers to hold the user data. Hadoop solution framework – A practical example (1/2)
  • 22. Now to run various reports on these huge data For eg: 1) Ratio of men vs. women users for a period. 2) No of users who commented on a particular day. Soln: For this requirement they had scripts written in python which uses ETL processes. But as the size of data increased to this extent these scripts did not work. Hence their main aim at this point of time was to handle data warehousing and their home ground solutions were not working. This is when Hadoop came into the picture.. Hadoop solution framework – A practical example (2/2)
  • 23. Hadoop Distributed File System (HDFS) Agenda HDFS Definition Architecture HDFS Components
  • 24. HDFS Definition  The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.  HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework.  It has many similarities with existing distributed file systems.  Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications.  HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.  HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high  throughput access to application data and is suitable for applications that have large data sets  HDFS consists of following components (daemons) • Name Node • Data Node • Secondary Name Node
  • 25. HDFS Components(1/2)  Name node: Name Node, a master server, manages the file system namespace and regulates access to files by clients. It has following properties.  Meta-data in Memory • The entire metadata is in main memory • Types of Metadata • List of files • List of Blocks for each file • List of Data Nodes for each block • File attributes, e.g. creation time, replication factor • ATransaction Log • Records file creations, file deletions. Etc.  Data Node: Data Nodes, a server where we store our actual data it should be one per node and it has following properties • A Block Server • Stores data in the local file system (e.g. ext3) • Stores meta-data of a block (e.g. CRC) • Serves data and meta-data to Clients • Block Report • Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data • Forwards data to other specified Data Nodes
  • 26.  Secondary Name Node • It is not used as hot stand-by or mirror node. It is just a failover node is in future release. • It is used for housekeeping purpose and in case of NN failure we can take data from this node. • It use to take backup Name Node periodically • Memory requirements are the same as Name Node (big) • Typically on a separate machine in large cluster ( > 10 nodes) • Directory is same as Name Node except it keeps previous checkpoint version in addition to current. • It can be used to restore failed Name Node (just copy current directory to new Name Node) HDFS Components(2/2)
  • 27.
  • 29. Introduction to MapReduce Framework “A programming model for parallel data processing. Hadoop can run map reduce programs in multiple languages like Java, Python, Ruby and C++.“  Map function: • Operate on set of key, value pairs • Map is applied in parallel on input data set • This produces output keys and list of values for each key depending upon the functionality • Mapper output are partitioned per reducer = No. Of reduce task for that job  Reduce function:  Operate on set of key, value pairs  Reduce is then applied in parallel to each group, again producing a collection of key, values.  No of reducers can be set by the user.
  • 30. How does a map-reduce algorithm work (1/2)
  • 31. How does a map-reduce algorithm work (2/2)
  • 32. Map Reduce Components  JobTracker : The Job-Tracker is responsible for accepting jobs from clients, dividing those jobs into tasks, and assigning those tasks to be executed by worker nodes.  TaskTracker : Task-Tracker process that manages the execution of the tasks currently assigned to that node. EachTaskTracker has a fixed number of slots for executing tasks (two maps and two reduces by default).
  • 33. MapReduce co-located with HDFS Slave node A Client submits MapReduce job JobTracker TaskTracker Slave node B Slave node C TaskTracker TaskTracker NameNode JobTracker and NameNode need not be on same node DataNode TaskTrackers (compute nodes) and DataNodes colocate = high aggregate bandwidth across cluster DataNode DataNode
  • 34. Understanding processing in a M/R framework (1/2)  User runs a program on the client computer  Program submits a job to HDFS. Job contains: • Input data • Map / Reduce program • Configuration information  Two types of daemons that control job execution: • Job Tracker (master node) • Task Trackers (slave nodes)  Job sent to Job Tracker then Job Tracker communicates with Name Node and assigns parts of job to Task Trackers (Task Tracker is run on each Data Node)  Task is a single MAP or REDUCE operation over piece of data  Hadoop divides the input to MAP / REDUCE job into equal splits  The Job Tracker knows (from Name Node) which node contains the data, and which other machines are nearby.  Task processes send heartbeats to Task Tracker and Task Tracker sends heartbeats to the Job Tracker.
  • 35.  Any tasks that did not report in certain time (default is 10 min) assumed to be failed and it’s JVM will be killed byTask Tracker and reported to the Job Tracker.  The JobTracker will reschedule any failed tasks (with differentTask Tracker)  If same task failed 4 times all job will fails  AnyTask Tracker reporting high number of failed jobs on particular node will be blacklist the node (remove metadata from Name Node)  JobTracker maintains and manages the status of each job. Results from failed tasks will be ignored Understanding processing in a M/R framework (2/2)
  • 36. Computing parallelism meet data locality  All map tasks are equivalent; so can run in parallel  All reduce tasks can also run in parallel  Input data on HDFS on can be processed independently  Therefore, run map task on whatever data is local (or closest) to a particular node in HDFS and will be good performance • For map task assignment, JobT racker has an affinity for a particular node which has a replica of the input data • If lots of data does happen to pile up on the same node, nearby nodes will map instead  And improve recovery from partial failure of servers or storage during the operation: if one map or reduce task fails, the work can be rescheduled
  • 37. Programming using MapReduce WordCount is a simple application that counts the number of occurences of each word in a given input file. Here we divide the entire code into 3 files 1)Mapper.java 2)Reducer.java 3)Basic.java
  • 38. Mapper.java import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text, IntWritable> { private final static IntWritable one = new IntWritable(1); privateText word = new Text(); public void map(LongWritable key,Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 39. For the following standard input mapper does the following Input: I am working forTCS TCS is a great company The Mapper implementation, via the map method, processes one line at a time, as provided by the specifiedTextInputFormat. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>. Output: <I,1> <am,1> <working,1> <for,1> <TCS,1> <TCS,1> <is,1> <a,1> <great,1> <company,1>
  • 40. Sorted mapper output to reducer Hence, the output of each map is passed through a sorting Algorithm which sorts the output of Map according to the keys. Output: <a,1> <am,1> <company,1> <for,1> <great,1> <I,1> <is,1> <TCS,1> <TCS,1> <working,1>
  • 41. Reducer.java import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class Reducer extends MapReduceBase implements Reducer<Text, IntWritable,Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 42. Reducer output  The output of Mapper is given to the Reducer, which Sums up the values, which are the occurrence counts for each key ( i.e. words in this example). Output: <a,1> <am,1> <company,1> <for,1> <great,1> <I,1> <is,1> <TCS,2> <working,1>
  • 43. Basic.java import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class Basic extends MapReduceBase implements Reducer<Text, IntWritable,Text, IntWritable> { public static void main(String[] args) throws Exception { JobConf conf = new JobConf(Basic.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Mapper.class); conf.setReducerClass(Reducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
  • 44. Executing the MapReduce program 1)Compile all the 3 java files which will create 3 .class files 2)Add all 3 .class files into 1 single jar file by writing this command jar –cvf file_name.jar *.class 3)Now you just need to execute single jar file by writing this command bin/hadoop jar file_name.jar Basic input_file_name output_file_name
  • 45.
  • 46. Hadoop Clusters Agenda Cluster Concepts Installing Hadoop Creating a pseudo cluster
  • 47. Clustering in Hadoop  Clustering in HADOOP can be achieved in the following modes  Local (Standalone) Mode-Used for Debugging: • By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This mode is useful for debugging.  Pseudo-Distributed Mode- Used for Development : • Hadoop can also be run on a single-node in a pseudo-distributed mode where all Hadoop daemon runs in a separate Java process  Fully-Distributed Mode- Used for Debugging, Development, Production : • In this mode all hadoop daemons will be running on separate nodes and it is useful for production.
  • 49. Executing a pseudo cluster  Format a new distributed-file system: $ bin/hadoop namenode -format  Start the hadoop daemons: $ bin/start-all.sh  Copy the input files into the distributed filesystem: $ bin/hadoop fs –copyFromLocal input1 input  Run some of the examples provided: $ bin/hadoop jar hadoop-examples.jar wordcount input output Examine the output files:  Copy the output files from the distributed file system to the local file system and examine them: $ bin/hadoop fs -copyToLocal output output $ cat output/part-00000  When you're done, stop the daemons with: $ bin/stop-all.sh
  • 51. Hadoop Ecosystems Agenda  Pig Concepts  Hive Concepts  HBase Concepts
  • 52. Hadoop Ecosystems Apache Hive Apache Pig Apache HBase Sqoop Oozie Hue Flume Apache Whirr Apache Zookeeper SQL-like language and metadata repository High-level language for expressing data analysis programs The Hadoop database. Random, real - time read/write access Highly reliable distributed coordination service Library for running Hadoop in the cloud Distributed service for collecting and aggregating log and event data Browser-based desktop interface for interacting with Hadoop Server-based workflow engine for Hadoop activities Integrating Hadoop with RDBMS
  • 53. Pig Concepts What is Pig ?  It is an open-source high-level dataflow system and introduced by Yahoo  Provides a simple language for queries and data manipulation, Pig Latin, that is compiled into map- reduce jobs that are run on Hadoop  Pig Latin combines the high-level data manipulation constructs of SQL with the procedural programming of map-reduce Why is it important?  Companies and organizations like Yahoo, Google and Microsoft are collecting enormous data sets in the form of click streams, search logs, and web crawls  Some form of ad-hoc processing and analysis of all of this information is required
  • 55. Hive Concepts What is Hive ?  It is an open-source DW solution built on top of Hadoop and introduced by Facebook  Support SQL-like declarative language called HiveQL which are compiled into map-reduce jobs executed on Hadoop  Also support custom map-reduce script to be plugged into query.  Includes a system catalog, Hive Metastore for query optimizations and data exploration Why is it important?  It is very easy to learn because of its similar behavior like SQL.  Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
  • 56. Hive execution plan Clients either via CLI/ JBDC/ODBC HIVE-QL Driver Invoke Compiler DAGof Map- Reduces Execution Engine hadoop
  • 57. Difference between Pig & Hive  Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library  Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data.  If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop".  Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several Map Reduce jobs on Hadoop.  Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems.  Pig lets users express them in a language not unlike a bash or perl script.
  • 58. HBase  Apache HBase in a few words: “HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable”  HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases.  HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.  However, HBase has many features which supports both linear and modular scaling.  HBase supports an easy to use Java API for programmatic access.
  • 59. Why is it important?  HBase is a Bigtable clone.  It is open source  It has a good community and promise for the future  It is developed on top of and has good integration for the Hadoop platform, if you are using Hadoop already.  It has a Cascading connector.  No real indexes  Automatic partitioning  Scale linearly and automatically with new nodes  Commodity hardware  Fault tolerance  Batch processing
  • 60. Difference between HBase and Hadoop/HDFS?  HDFS is a distributed file system that is well suited for the storage of large files. Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files.  HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.