SlideShare une entreprise Scribd logo
1  sur  138
BIG DATA
BY,
SHASHANK SHETTY
ASSISTANT PROFESSOR, DEPT OF CSE
NMAM INSTITUTE OF TECHNOLOGY, Nitte
CONTENTS (1/2)
 Big Data Definition
 Areas Of Challenges
 Big data Attributes.
 Big data Source.
 Sample Events generating data
 New tools for generating data
 Big data applications.
 Getting value from big data.
 Big data security
 Comparing Hadoop With RDBMS
 Hadoop
CONTENTS(2/2)
HDFS (Hadoop distributed File sytem)
 MapReduce
HBASE
 YARN
 Hadoop Ecosystem
 SQOOP
 HIVE
 PIG
 HUE
 FLUME
 Conclusion
 Reference
WHAT IS BIG DATA???
BIG DATA
So Large Data That It
Becomes Difficult To
Process It using
Traditional Systems
Is
SOURCE: PLANNING FOR BIG DATA, EDD DUMBILL, PP.1-4
WHAT IS BIG DATA???
DIFFICULT TO PROCESS BY TRADITIONAL
SYSTEM
200 MB DOCUMENT 150 GB IMAGE
200 TB VIDEO
Unable To
Send
Unable To
View
Unable To
Edit
Depends On
the Capabilities
of the System
ORGANIZATION SPECIFIC
500TB Text, Audio, Video Data
Per Day
BIG
DATA
NOT
A
BIG
DATA
COMPANY 1 COMPANY 2
Depends On
the Capabilities
of the
Organization
AREAS OF CHALLENGE
CAPTURE
 SEARCH
 SHARING
 STORAGE
TRANSFER
ANALYSIS
VISUALIZATION
BIG DATA
BIG DATAATTRIBUTES ???
 Large and Growing files
At High Speed
 In Various Format
V Attributes
VOLUME
VELOCITY
VARIETY
The Data
Comes At
High Speed
The Data
Results Into
Large Files
The Files
Come Under
Various
Formats
Unstructured
Data
90%
Structured
Data
10%
Mostly
Wasted
Used In
Decision
Making
STRUCTURED DATA
Challenging
/Opportunity
To Analyze And Extract
Meaningful Information
BIG DATA SOURCES
USERS
APPLICATION
SYSTEMS
SENSORS
LARGE AND GROWING
FILES
(BIG DATA FILES)
Are
Creating
DATA GENERATION POINT
EXAMPLES
 MOBILE DEVICES
 MICROPHONES
 READERS/SCANNERS
 CAMERAS
 MACHINE SENSORS
 SOCIAL MEDIA
 PROGRAMS/ SOFTWARES
SCIENCE FACILITIES
SAMPLE DATA TYPES
 VIDEOS
 AUDIOS
 IMAGES
 PHOTOS
 LOGS
 CLICK TRAILS
 TEXT MESSAGES
 EMAILS
 DOCUMENTS
 BOOKS
 TRANSACTIONS
 PUBLIC RECORDS
SAMPLE EVENTS GENERATING DATA
1)Air Bus:
 Airbus generates 10TB every 30 minutes
 About 640TB is generated in one flight
2) Smart Meters:
 Smart Meter reads the usage every 15 minutes
 Records 350 Billion Transaction every year.
 In 2009, there were 76 million smart meters.
 By 2014, there will be 200 million smart meters
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
3) Camera Phones:
 5 million camera phones are there world wide.
 Most of them have location Awareness ( G.P.S)
 22% of them are Smartphone's.
 By the End of 2013 the number of Smartphone's
will exceed the number of PC‟s
4) Internet Users:
 2+ billion people use internet.
 By 2014 CISCO estimates internet traffic 4.8
Zettabytes per year
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
5) Blogs:
 There are 200 billion blog entries in the world.
6) Emails:
 300 million Emails are sent every day.
7) RFID:
 In 2005, there were around 1.5 million RFID‟s
 In 2012, there are 30 million RFID‟s
WalMart as played
the major role
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
8) Facebook:
 Facebook generates 25TB of data daily.
9) Twitter:
 Twitter generates 12TB of data daily.
 200 million users generating 230 million tweets daily.
 97,000 tweets are sent every seconds.
10) Trading:
 NYSE produces 1TB per trading day.
11) Experiment:
 CERN atomic facility generates 40TB per second.
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
SAMPLE EVENTS GENERATING DATA
Big Data:
 In 2009, the total data was estimated to be 1 ZB
 In 2020, it is estimated to be 35ZB
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
New Tools For Big Data
TRADITIONAL
SYSTEMS
(E.g.,RDBMS)
BIG DATA
TOOLS
(E.g.,
HADOOP)
TIMEE
Not able to
handle Big
Data
Created to
handle Big
Data
Big Data Applications
 Companies gaining edge by collecting,
analyzing and understanding information.
 Government forecasting events and taking
proactive actions.
Getting value from Big Data
Collect Analyze Understand
EXTRACT
HERE
Big Data Security Issues
 Security and privacy issues are magnified by the V
attributes.
 Velocity
 Volume
 Variety
 Traditional Security mechanisms which are tailoured
to securing small scale static data are inadequate.
SOURCE: CLOUD SECURITY ALLIANCES
Top Five Security Challenges
1) Secure Computation in Distributed
Programming framework:
 Distributed programming framework utilizes parallism in
computation and storage to process massive amount of data.
Example: MAPREDUCE Framework:
 Splits input files into multiple chuncks.
 These chunks are read by the mapper and outputs key/value pairs.
 The reducer combines the values belonging to distinct key and outputs the
result.
OPPORTUNITY 1: Two Major prevention measure arises
1) Securing Mapper
2) Securing the data in the presence of untrusted mapper
SOURCE: CLOUD SECURITY ALLIANCES
2) Input Validation/Filtering
 Input Validation:
 What kind of data is untrusted?
 What are the untrusted data sources?
 Data Filtering:
 Filter Rogue or malicious data.
 Challenges/ opportunity
 GB‟s or TB‟s Continuous data
 Signature based data filtering has limitations
SOURCE: CLOUD SECURITY ALLIANCES
3) Secure Data storage
 Data at various nodes, authentication, authorization and
encryption is challenging.
 Autotiering moves cold data into lesser secure medium
o What if the cold data is sensitive?
 Autotier doesnot keep track of where the data is stored.
(new challenge)
 Encryption of real time data can have performance
impact.
 Challenges/opportunity:
 24/7 availability of data
 unauthorized access
SOURCE: CLOUD SECURITY ALLIANCES
4) Privacy concern in data mining.
 Sharing of results involve multiple challenges.
o Invasion of privacy.
o Invasive Marketing.
o Unintentional disclosure of Information.
 Example: Companies and government agencies they
constantly mined and analyzed by the inside analysts
and also potentially outside contractors
 Challenges/Opportunity: Robust and scalable
privacy preserving mining algorithms
SOURCE: CLOUD SECURITY ALLIANCES
5) Cryptography enforced access
control and secure communication
 To ensure end to end secure private data.
 Accessible to only authorized entity.
 Hence Cryptography enforced access control
has to be implemented.
 Challenges/ opportunities: The main
problem to encrypt data especially large data
sets, is all-or-nothing retrieval policy,
disallowing user to easily search or share data.
SOURCE: CLOUD SECURITY ALLIANCES
COMPARING HADOOP
WITH RDBMS
vs
………………
… … … … … … …
Comparing Hadoop with RDBMS
 Until recently many applications utilized Relational
Database systems (RDBMS) for batch processing.
-Oracle, Sybase, MySQL, Microsoft SQL, Server etc.
-Hadoop doesn‟t fully replace relational products; many
architectures would benefit from both hadoop and
Relational product(s).
 Scale-Out vs Scale-up
-RDBMS products scale up
 Expensive to scale for large installation.
 Hits a ceiling when storage reaches 100s of terabytes.
- Hadoop clusters can scale-out to 100s of machines and to
petabytes of storage.
 Structured Relational vs Semi-structured vs unstructured
-RDBMS works well for structured data-tables that conform to
a predefined schema.
-Hadoop works best on semi structured and unstructured data.
 Semi-structured may have schema that is loosely followed.
 Unstructured data has no structure whatsoever and Is usually
blocks of text (or for example images)
 At processing time types for key and values are choosen by
the implementer.
-Certain types of input data will not easily fit into relational
schema such as JSON, XML etc.
Comparing Hadoop with RDBMS (contd..)
 Offline batch vs Online Transactions
- Hadoop was not designed for real time and low latency
queries.
- Products that do provide low latency queries such as Hbase
have limited query functionality.
- Hadoop performs best for offline batch processing on large
amounts of data.
- RDBMS is best for online transactions and low latency
queries.
- Hadoop is designed to stream large files and large amounts of
data.
- RDBMS works best with small records.
Comparing Hadoop with RDBMS (contd..)
HADOOP
 Framework for running applications on large clusters of
commodity hardware
 Scale: petabytes of data on thousands of nodes
 Include
 Storage: HDFS
 Processing: MapReduce
Support the Map/Reduce programming model
 Requirements
 Economy: use cluster of comodity computers
 Easy to use
Users: no need to deal with the complexity of
distributed computing
 Reliable: can handle node failures automatically
What's Hadoop ..Contd??
 Hadoop is a software platform that lets one easily write
and run applications that process vast amounts of data.
 Here's what makes Hadoop especially useful:
 Scalable
 Economical
 Efficient
 Reliable
Hadoop, Why?
 Need to process Multi Petabyte Datasets
 Expensive to build reliability in each application.
 Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
 Need common infrastructure
– Efficient, reliable, Open Source Apache License
 The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
HDFS
(Hadoop Distributed File
System)
HDFS
 Hadoop implements MapReduce, using the Hadoop
Distributed File System (HDFS) (see figure below.)
 MapReduce divides applications into many small blocks of
work. HDFS creates multiple replicas of data blocks for
reliability, placing them on compute nodes around the
cluster. MapReduce can then process the data where it is
located.
 Hadoop has been demonstrated on clusters with 2000
nodes. The current design target is 10,000 node clusters.
Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to
where data resides
– Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS
Hadoop at Facebook
• Production cluster
– 4800 cores, 600 machines, 16GB per machine – April 2009
– 8000 cores, 1000 machines, 32 GB per machine – July
2009
– 4 SATA disks of 1 TB each per machine
– 2 level network hierarchy, 40 machines per rack
– Total cluster size is 2 PB, projected to be 12 PB in Q3
2009
• Test cluster
• 800 cores, 16GB each
Hadoop Architecture
Data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Results
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Hadoop Cluster
DFS Block 1
DFS Block 1
DFS Block 2
DFS Block 2
DFS Block 2
DFS Block 1
DFS Block 3
DFS Block 3
DFS Block 3
MAP
MAP
MAP
Reduce
MapReduce
Map/Reduce Processes
 Launching Application
– User application code
– Submits a specific kind of Map/Reduce job
 JobTracker
– Handles all jobs
– Makes all scheduling decisions
 TaskTracker
– Manager for all tasks on a given node
 Task
– Runs an individual map or reduce fragment for a
given job
– Forks from the TaskTracker
-cont’d
 Hadoop Map/Reduce – Goals:
• Process large data sets
• Cope with hardware failure
• High throughput
Hadoop Map-Reduce Architecture
 Master-Slave architecture
 Map-Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, re-executes tasks upon
failure
 Map-Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction from the Jobtracker
– Manage storage and transmission of intermediate output
NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the
NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica
• Would like to make this policy pluggable
Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
Data Pipelining
• Client retrieves a list of DataNodes on which to place
replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next DataNode
in the Pipeline
• When all replicas are written, the Client moves on to write
the next block in file
Data Flow
Web Servers Scribe Servers
Network
Storage
Hadoop ClusterOracle RAC MySQL
Example for MapReduce
 Page 1: the weather is good
 Page 2: today is good
 Page 3: good weather is good.
Reduce Input
 Worker 1:
 (the 1)
 Worker 2:
 (is 1), (is 1), (is 1)
 Worker 3:
 (weather 1), (weather 1)
 Worker 4:
 (today 1)
 Worker 5:
 (good 1), (good 1), (good 1), (good 1)
Reduce Output
 Worker 1:
 (the 1)
 Worker 2:
 (is 3)
 Worker 3:
 (weather 2)
 Worker 4:
 (today 1)
 Worker 5:
 (good 4)
Parallel Execution
Hadoop Web Interface
• MapReduce Job Tracker Web Interface
The job tracker web UI provides information about general job statistics of
the Hadoop cluster, running/completed/failed jobs and a job history log file.
It also gives access to the local machine's Hadoop log files (the machine on
which the web UI is running on).
By default, it's available at http://localhost:50030/
• Task Tracker Web Interface
The task tracker web UI shows you running and non-running tasks. It also
gives access to the local machine's Hadoop log files.
By default, it's available at http://localhost:50060/
• HDFS Name Node Web Interface
The name node web UI shows you a cluster summary including information
about total/remaining capacity, live and dead nodes. Additionally, it allows
you to browse the HDFS namespace and view the contents of its files in the
web browser. It also gives access to the local machine's Hadoop log files.
By default, it's available at http://localhost:50070/
HBASE
HBase is a database: the Hadoop database. It is indexed by row key,
column key, and timestamp.
 HBase stores structured and semistructured data naturally so you can
load it with tweets and parsed log files and a catalog of all your products
right along with their customer reviews.
It can store unstructured data too, as long as it‟s not too large
HBase is designed to run on a cluster of computers instead of a single
computer. The cluster can be built using commodity hardware; HBase
scales horizontally as you add more machines to the cluster.
HBASE (Contd…)
Each node in the cluster provides a bit of storage, a bit of cache,
and a bit of computation as well.
This makes HBase incredibly flexible and forgiving. No node is
unique, so if one of those machines breaks down, you simply
replace it with another.
 This adds up to a powerful, scalable approach to data that,until
now, hasn‟t been commonly available to mere mortals.
HBASE DATA MODEL:
Hbase Data model - These six concepts form the foundation of HBase.
Table:
HBase organizes data into tables. Table names are Strings and composed of characters
that are safe for use in a file system path.
Row :
 Within a table, data is stored according to its row. Rows are identified uniquely by
their row key. Row keys don‟t have a data type and are always treated as a byte[].
Column family:
 Data within a row is grouped by column family. Column families also impact the
physical arrangement of data stored in HBase.
 For this reason, they must be defined up front and aren‟t easily modified. Every row
in a table has the same column families, although a row need not store data in all its
families. Column family names are Strings and composed of characters that are safe for
use in a file system path.
 Column qualifier:
 Data within a column family is addressed via its column qualifier,or column.
Column qualifiers need not be specified in advance. Column qualifiers need not be
consistent between rows.
 Like rowkeys, column qualifiers don‟t have a data type and are always treated as a
byte[].
 Cell:
A combination of rowkey, column family, and column qualifier uniquely identifies
a cell. The data stored in a cell is referred to as that cell‟s value. Values also don‟t
have a data type and are always treated as a byte[].
 Version:
 Values within a cell are versioned. Versions are identified by their timestamp,a
long. When a version isn‟t specified, the current timestamp is used as the basis for the
operation. The number of cell value versions retained by HBase is configured via the
column family. The default number of cell versions is three.
Hbase Architecture
HBase Tables and Regions
Table is made up of any number of regions.
Region is specified by its startKey and endKey.
 Empty table: (Table, NULL, NULL)
 Two-region table: (Table, NULL, “com.ABC.www”) and
(Table, “com.ABC.www”, NULL)
Each region may live on a different node and is made up of
several HDFS files and blocks, each of which is replicated by
Hadoop
YARN
Why Next Generation MR
 Reliability
 Availability
 Scalability - Clusters of 10,000 machines and 200,000
cores, and beyond.
 Backward (and Forward) Compatibility
 Ensure customers’ MapReduce applications run
unchanged in the next version of the framework.
 Evolution – Ability for customers to control upgrades to
the Hadoop software stack.
 Predictable Latency – A major customer concern.
 Cluster utilization
Why Next Generation MR
 Secondary Requirements
–Support for alternate programming
paradigms to MapReduce.
–Support for short-lived services
ReArchitecure
• Need
– Separate the tasks of Job Tracker
• Resource management
• Job Scheduling / Management
So, What did we
come up with
• Resource Manager
• Node Manager
• Application
Master
• Container
Resource Manager (RM)
Manages the global
assignment of compute
resources to applications.
Resource Manager (RM)
• A pure Scheduler
• No monitoring, tracking
status of application
• No guarantee on restarting
failed tasks.
Resource Manager (RM)
• Each client/application may
request multiple resources
– Memory
– Network
– Cpu
– Disk ..
• This is a significant change
from static Mapper /
Reducer model
Application Master
• A per – application
ApplicationMaster (AM) that
a ages the appli atio ’s life
cycle (scheduling and
coordination).
• An application is either a single
job in the classic MapReduce
jobs or a DAG of such jobs.
Application Master
A per – application
ApplicationMaster (AM) that
a ages the appli atio ’s life
cycle.
Application Master
• Application Master has the
responsibility of
– negotiating appropriate resource
containers from the Scheduler
– launching tasks
– tracking their status
– monitoring for progress
– handling task-failures.
Node Manager
• The NodeManager is the per-machine
framework agent
– responsible for launching the
applications‟ containers,
monitoring their resource usage
(cpu, memory, disk, network) and
reporting the same to the
Scheduler.
Gain with New Architecture
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
• Support for programming paradigms other than MapReduce
Gain with New Architecture
• RM and Job manager segregated
• The Hadoop MapReduce JobTracker
spends a very significant portion of
time and effort managing the life
cycle of applications
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
Gain with New Architecture
• ResourceManage
– Uses ZooKeeper for fail-over.
– When primary fails, secondary can
quickly start using the state stored
in ZK
• Application Master
– MapReduce NextGen supports
application specific checkpoint
capabilities for the
ApplicationMaster.
– MapReduce ApplicationMaster can
recover from failures by restoring
itself from state saved in HDFS.
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
Gain with New Architecture
• MapReduce NextGen uses wire-
compatible protocols to allow
different versions of servers and
clients to communicate with
each other.
• Rolling upgrades for the cluster
in future.
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
Gain with New Architecture
• New framework is generic.
– Can came up with non MR parallel
computing techniques
– Different versions of MR running in
parallel 
– End users can upgrade to MR versions
on their own schedule
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
Gain with New
Architecture
• MRv2 uses a general concept of a
resource for scheduling and allocating to
individual applications.
• Container , can be a mapper or a reducer
or … ?
• Stubborn notion of Mapper,Reducer
abolished
• Better cluster utilization
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
The Hadoop
Ecosystem
 When Hadoop 1.0.0 was released by Apache in 2011, comprising
mainly HDFS and MapReduce, it soon became clear that Hadoop
was not simply another application or service, but a platform
around which an entire ecosystem of capabilities could be built.
Since then, dozens of self-standing software projects have sprung
into being around Hadoop, each addressing a variety of problem
spaces and meeting different needs.
 Many of these projects were begun by the same people or
companies who were the major developers and early users of
Hadoop; others were initiated by commercial Hadoop distributors.
The majority of these projects now share a home with Hadoop at
the Apache Software Foundation, which supports open-source
software development and encourages the development of the
communities surrounding these projects.
SQOOP
SQOOP
 Data Import/ Export.
 SQOOP is a tool designed to help
users of large data import existing
relational databases into their hadoop
clusters.
 Automatic data import.
 Easy import data from many
databases to Hadoop.
 Generates code for use in Mapreduce
applications.
Source: Big Data Analytics with Hadoop
 Sqoop is a tool designed to transfer data between Hadoop
and relational databases.
 You can use Sqoop to import data from a relational database
management system (RDBMS) such as MySQL or Oracle into
the Hadoop Distributed File System (HDFS), transform the
data in Hadoop MapReduce, and then export the data back
into an RDBMS.
What is Sqoop?
SQOOP
9
1 ©2011 Cloudera, Inc. All Rights Reserved.
91
RDBMS
Sqoop
HDFS
HIVE
HIVE
 Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and
analysis.
– ETL.
– Structure.
– Access to different storage.
– Query execution via MapReduce.
 While initially developed by Facebook, Apache Hive is now
used and developed by other companies such as Netflix.
 Key Building Principles:
– SQL is a familiar language
– Extensibility – Types, Functions, Formats, Scripts
– Performance
Databases.
Tables.
Partitions.
Buckets (or Clusters).
Data Units
Hive, Why?
• Need a Multi Petabyte Warehouse
• Files are insufficient data abstractions
– Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!
Hadoop & Hive History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject
Hive architecture
•Hive structures data into well-understood database concepts
such as: tables, rows, cols, partitions
•It supports primitive types: integers, floats, doubles, and
strings
•Hive also supports:
–associative arrays: map<key-type, value-type>
–Lists: list<element type>
–Structs: struct<file name: file type…>
•SerDe: serialize and deserialized API is used to move data
in and out of tables
Data model
Query Language (HiveQL)
• Subset of SQL
• Meta-data queries
• Limited equality and join predicates
• No inserts on existing tables (to preserve
worm property)
– Can overwrite an entire table
Hive - DDL
 Create table
hive> CREATE TABLE customer (age INT, address STRING);
 Partitions
hive> CREATE TABLE customer (age INT, address STRING)
PARTITIONED BY ( sdate STRING) ;
 Show table
hive> SHOW TABLES ;
 Describe table
hive> DESCRIBE customer;
Hive - DDL
 Alter table
hive> ALTER TABLE customer ADD COLUMNS ( age INT) ;
 Drop table
hive> DROP TABLE customer;
HiveQL Examples
HiveQL, an SQL like language
hive> SELECT a.age FROM customer a WHERE a.sdate ='2008-08-
15';
selects all data from table for a partition but doesnt store it
hive> INSERT OVERWRITE DIRECTORY '/data/hdfs_file'
SELECT a.* FROM customer a WHERE a.sdate='2008-08-15';
writes all of customer table to an hdfs directory
Wordcount in Hive
FROM (
MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
FROM docs
CLUSTER BY word
) a
REDUCE word, cnt USING 'pythonwc_reduce.py';
Hive Usage in Facebook
• Hive and Hadoop are extensively used in Facbook for
different kinds of operations.
• 700 TB = 2.1Petabyte after replication!
• Think of other application model that can leverage
Hadoop MR.
Hive – Related Projects
 Apache Flume – move large data sets to Hadoop
 Apache Sqoop – cmd line, move rdbms data to Hadoop
 Apache Hbase – Non relational database
 Apache Pig – analyse large data sets
 Apache Oozie – work flow scheduler
 Apache Mahout – machine learning and data mining
 Apache Hue – Hadoop user interface
 Apache Zoo Keeper – configuration / build
PIG
Introduction
• What is Pig?
– An open-source high-level dataflow system
– Provides a simple language for queries and data
manipulation, Pig Latin, that is compiled into map-reduce
jobs that are run on Hadoop
– Pig Latin combines the high-level data manipulation
constructs of SQL with the procedural programming of
map-reduce
• Why is it important?
– Companies and organizations like Yahoo, Google and
Microsoft are collecting enormous data sets in the form of
click streams, search logs, and web crawls
– Some form of ad-hoc processing and analysis of all of this
information is required
Existing Solutions
• Parallel database products (ex: Teradata)
– Expensive at web scale
– Data analysis programmers find the declarative SQL
queries to be unnatural and restrictive
• Raw map-reduce
– Complex n-stage dataflows are not supported; joins
and related tasks require workarounds or custom
implementations
– Resulting code is difficult to reuse and maintain; shifts
focus and attention away from data analysis
Language Features
• Several options for user-interaction
– Interactive mode (console)
– Batch mode (prepared script files containing Pig Latin commands)
– Embedded mode (execute Pig Latin commands within a Java program)
• Built primarily for scan-centric workloads and read-only data
analysis
– Easily operates on both structured and schema-less, unstructured data
– Transactional consistency and index-based lookups not required
– Data curation and schema management can be overkill
• Flexible, fully nested data model
• Extensive UDF support
– Currently must be written in Java
– Can be written for filtering, grouping, per-tuple processing, loading and
storing
Pig Latin vs. SQL
• Pig Latin is procedural (dataflow programming model)
– Step-by-step query style is much cleaner and easier to write and
follow than trying to wrap everything into a single block of
SQL
Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html
Pig Latin vs. SQL (continued)
• Lazy evaluation (data not processed prior to STORE command)
• Data can be stored at any point during the pipeline
• An execution plan can be explicitly defined
– No need to rely on the system to choose the desired plan via optimizer hints
• Pipeline splits are supported
– SQL requires the join to be run twice or materialized as an intermediate result
Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html
Data Model
• Supports four basic types
– Atom: a simple atomic value (int, long, double, string)
• ex: „Peter‟
– Tuple: a sequence of fields that can be any of the data types
• ex: („Peter‟, 14)
– Bag: a collection of tuples of potentially varying structures,
can contain duplicates
• ex: {(„Peter‟), („Bob‟, (14, 21))}
– Map: an associative array, the key must be a chararray but
the value can be any type
Data Model (continued)
• By default Pig treats undeclared fields as bytearrays
(collection of uninterpreted bytes)
• Ca i fer a field’s type ased o :
– Use of operators that expect a certain type of field
– UDFs with a known or explicitly set return type
– Schema information provided by a LOAD function
or explicitly declared using an AS clause
• Type conversion is lazy
Pig problem
• Fragment-replicate; skewed; merge join
• User has to know when to use which join
• Because… Pig is
domestic animal,
does whatever
you tell it to do.
- Alan Gates
Images from http://wiki.apache.org/pig/PigTalksPapers
HUE
Hue – What is it ?
 Hue = Hadoop User Experience
 Hue is an open-source Web interface that supports Apache
Hadoop and its ecosystem, licensed under the Apache v2 license.
 Its main goal is to have the users "just use" Hadoop without
worrying about the underlying complexity or using a command
line
 An open source Hadoop GUI
 Developed by Cloudera
 Web based
 Many functions
Hue – Why ???
 It is widely used
 It ships with Hadoop
 It integrates with Hadoop tools i.e.
 Hive
 Oozie
 HDFS
 It has an API for app creation
Hue Features
 HDFS file browser
 Job browser / designer
 Hive / Pig query editor
 Oozie app for work flows
 Has Hadoop API
 Access to shell
 User Admin
 App for Solr searches
Hue File Browser
Hue Job Browser
Hue Job Designer
Hue Query Editor
Hue Work Flow
FLUME
What is Apache Flume?
● It is a distributed data collection service that gets
flows of data (like logs) from their source and
aggregates them to where they have to be processed.
● Goals: reliability, scalability, extensibility,
manageability.
Exactly what I needed!
The Flume Model: Flows and
Nodes
● A flow corresponds to a type of data source (server
logs, machine monitoring metrics...).
● Flows are comprised of nodes chained together.
The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
...are optionally processed by one or more decorators... ...and
then are transmitted out via a sink.
Examples: Console, Exec, Syslog, IRC,
Twitter, other nodes...
Examples: Console, local files, HDFS,
S3,
other nodes...
Examples: wire batching, compression,
sampling, projection, extraction...
● Agent:
receives data from an
application.
● Processor (optional):
intermediate processing.
● Collector:
write data to permanent
storage.
The Flume Model: Agent, Processor and
Collector Nodes
The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.
The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.
Flume Goals: Reliability
Tunable Failure Recovery Modes
● Best Effort
● Store on Failure and Retry
● End to End Reliability
Flume Goals: Scalability
Horizontally Scalable Data Path
Load Balancing
Flume Goals: Scalability
Horizontally Scalable Control Path
Flume Goals: Extensibility
 Simple Source and Sink API
 Event streaming and composition of simple
operation
 Plug in Architecture
 Add your own sources, sinks, decorators
Conclusion
Flume is
● Distributed data collection service
● Suitable for enterprise setting
● Large amount of log data to process
Conclusion
 Big data is here to stay, It is impossible to imagine
the next generation without it consuming data,
producing new forms of data and containing data
driven algorithms.
 As compute environment become cheaper,
application environment becomes networked over
cloud. So security, access control, compression and
encryption introduce challenges that have to be
addressed in a systematic manner.
References
[1] Chris Eaton, Dirk Deroos, Tom Deutsh, George Lapis, Paul Zikopoulos, Understanding Big
Data, Analysis for enterprise class hadoop and streaming data, pp.3-49.
[2] Mike Barlow, Real time data analystics, Emerging Architecture, February 2013, First edition,
pp.1-21.
[3] Sachidanand Singh, Nirmala Singh, Big Data Analytics,2012 International Conference on
Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai,
India
[4] Big Data Introduction, www.youtube.com/watch?v=e6kovHZ6FVc
[5] Hadoop Video, www.youtube.com/watch?v=OoEpfbbyga8
[6] Cloud Security Alliance, Big Data Security and privacy issues, November 2012.
[7] http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
[8] http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html
[9] http://www.youtube.com/watch?v=5Eib_H_zCEY&feature=related
[10] http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=related
[11] http://labs.google.com/papers/gfs-sosp2003.pdf
[12] http://hadoop.apache.org/core/docs/current/hdfs_design.html
[13] http://hadoop.apache.org/core/docs/current/api/
[14] http://hadoop.apache.org/hive/
[15]http://www.cloudera.com/resource/chicago_data_summit_flume_an_introduction_jonathan_h
sieh_hadoop_log_processing
[16] http://www.slideshare.net/cloudera/inside-flume
[17] http://www.slideshare.net/cloudera/flume-intro100715
[18] http://www.slideshare.net/cloudera/flume-austin-hug-21711
THANK YOU!!!

Contenu connexe

Tendances

Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
Ajay Ohri
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 

Tendances (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 

Similaire à BIG DATA

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 

Similaire à BIG DATA (20)

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Big data
Big dataBig data
Big data
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articles
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop Environment
 
Big Data
Big DataBig Data
Big Data
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 

Dernier

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 

Dernier (20)

DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 

BIG DATA

  • 1. BIG DATA BY, SHASHANK SHETTY ASSISTANT PROFESSOR, DEPT OF CSE NMAM INSTITUTE OF TECHNOLOGY, Nitte
  • 2. CONTENTS (1/2)  Big Data Definition  Areas Of Challenges  Big data Attributes.  Big data Source.  Sample Events generating data  New tools for generating data  Big data applications.  Getting value from big data.  Big data security  Comparing Hadoop With RDBMS  Hadoop
  • 3. CONTENTS(2/2) HDFS (Hadoop distributed File sytem)  MapReduce HBASE  YARN  Hadoop Ecosystem  SQOOP  HIVE  PIG  HUE  FLUME  Conclusion  Reference
  • 4. WHAT IS BIG DATA??? BIG DATA So Large Data That It Becomes Difficult To Process It using Traditional Systems Is SOURCE: PLANNING FOR BIG DATA, EDD DUMBILL, PP.1-4
  • 5. WHAT IS BIG DATA???
  • 6. DIFFICULT TO PROCESS BY TRADITIONAL SYSTEM 200 MB DOCUMENT 150 GB IMAGE 200 TB VIDEO Unable To Send Unable To View Unable To Edit Depends On the Capabilities of the System
  • 7. ORGANIZATION SPECIFIC 500TB Text, Audio, Video Data Per Day BIG DATA NOT A BIG DATA COMPANY 1 COMPANY 2 Depends On the Capabilities of the Organization
  • 8. AREAS OF CHALLENGE CAPTURE  SEARCH  SHARING  STORAGE TRANSFER ANALYSIS VISUALIZATION
  • 9. BIG DATA BIG DATAATTRIBUTES ???  Large and Growing files At High Speed  In Various Format
  • 10. V Attributes VOLUME VELOCITY VARIETY The Data Comes At High Speed The Data Results Into Large Files The Files Come Under Various Formats
  • 12. BIG DATA SOURCES USERS APPLICATION SYSTEMS SENSORS LARGE AND GROWING FILES (BIG DATA FILES) Are Creating
  • 13. DATA GENERATION POINT EXAMPLES  MOBILE DEVICES  MICROPHONES  READERS/SCANNERS  CAMERAS  MACHINE SENSORS  SOCIAL MEDIA  PROGRAMS/ SOFTWARES SCIENCE FACILITIES
  • 14. SAMPLE DATA TYPES  VIDEOS  AUDIOS  IMAGES  PHOTOS  LOGS  CLICK TRAILS  TEXT MESSAGES  EMAILS  DOCUMENTS  BOOKS  TRANSACTIONS  PUBLIC RECORDS
  • 15. SAMPLE EVENTS GENERATING DATA 1)Air Bus:  Airbus generates 10TB every 30 minutes  About 640TB is generated in one flight 2) Smart Meters:  Smart Meter reads the usage every 15 minutes  Records 350 Billion Transaction every year.  In 2009, there were 76 million smart meters.  By 2014, there will be 200 million smart meters SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
  • 16. 3) Camera Phones:  5 million camera phones are there world wide.  Most of them have location Awareness ( G.P.S)  22% of them are Smartphone's.  By the End of 2013 the number of Smartphone's will exceed the number of PC‟s 4) Internet Users:  2+ billion people use internet.  By 2014 CISCO estimates internet traffic 4.8 Zettabytes per year SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
  • 17. 5) Blogs:  There are 200 billion blog entries in the world. 6) Emails:  300 million Emails are sent every day. 7) RFID:  In 2005, there were around 1.5 million RFID‟s  In 2012, there are 30 million RFID‟s WalMart as played the major role SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
  • 18. 8) Facebook:  Facebook generates 25TB of data daily. 9) Twitter:  Twitter generates 12TB of data daily.  200 million users generating 230 million tweets daily.  97,000 tweets are sent every seconds. 10) Trading:  NYSE produces 1TB per trading day. 11) Experiment:  CERN atomic facility generates 40TB per second. SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
  • 19. SAMPLE EVENTS GENERATING DATA Big Data:  In 2009, the total data was estimated to be 1 ZB  In 2020, it is estimated to be 35ZB SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
  • 20. New Tools For Big Data TRADITIONAL SYSTEMS (E.g.,RDBMS) BIG DATA TOOLS (E.g., HADOOP) TIMEE Not able to handle Big Data Created to handle Big Data
  • 21. Big Data Applications  Companies gaining edge by collecting, analyzing and understanding information.  Government forecasting events and taking proactive actions.
  • 22. Getting value from Big Data Collect Analyze Understand EXTRACT HERE
  • 23. Big Data Security Issues  Security and privacy issues are magnified by the V attributes.  Velocity  Volume  Variety  Traditional Security mechanisms which are tailoured to securing small scale static data are inadequate. SOURCE: CLOUD SECURITY ALLIANCES
  • 24. Top Five Security Challenges 1) Secure Computation in Distributed Programming framework:  Distributed programming framework utilizes parallism in computation and storage to process massive amount of data. Example: MAPREDUCE Framework:  Splits input files into multiple chuncks.  These chunks are read by the mapper and outputs key/value pairs.  The reducer combines the values belonging to distinct key and outputs the result. OPPORTUNITY 1: Two Major prevention measure arises 1) Securing Mapper 2) Securing the data in the presence of untrusted mapper SOURCE: CLOUD SECURITY ALLIANCES
  • 25. 2) Input Validation/Filtering  Input Validation:  What kind of data is untrusted?  What are the untrusted data sources?  Data Filtering:  Filter Rogue or malicious data.  Challenges/ opportunity  GB‟s or TB‟s Continuous data  Signature based data filtering has limitations SOURCE: CLOUD SECURITY ALLIANCES
  • 26. 3) Secure Data storage  Data at various nodes, authentication, authorization and encryption is challenging.  Autotiering moves cold data into lesser secure medium o What if the cold data is sensitive?  Autotier doesnot keep track of where the data is stored. (new challenge)  Encryption of real time data can have performance impact.  Challenges/opportunity:  24/7 availability of data  unauthorized access SOURCE: CLOUD SECURITY ALLIANCES
  • 27. 4) Privacy concern in data mining.  Sharing of results involve multiple challenges. o Invasion of privacy. o Invasive Marketing. o Unintentional disclosure of Information.  Example: Companies and government agencies they constantly mined and analyzed by the inside analysts and also potentially outside contractors  Challenges/Opportunity: Robust and scalable privacy preserving mining algorithms SOURCE: CLOUD SECURITY ALLIANCES
  • 28. 5) Cryptography enforced access control and secure communication  To ensure end to end secure private data.  Accessible to only authorized entity.  Hence Cryptography enforced access control has to be implemented.  Challenges/ opportunities: The main problem to encrypt data especially large data sets, is all-or-nothing retrieval policy, disallowing user to easily search or share data. SOURCE: CLOUD SECURITY ALLIANCES
  • 30. Comparing Hadoop with RDBMS  Until recently many applications utilized Relational Database systems (RDBMS) for batch processing. -Oracle, Sybase, MySQL, Microsoft SQL, Server etc. -Hadoop doesn‟t fully replace relational products; many architectures would benefit from both hadoop and Relational product(s).  Scale-Out vs Scale-up -RDBMS products scale up  Expensive to scale for large installation.  Hits a ceiling when storage reaches 100s of terabytes. - Hadoop clusters can scale-out to 100s of machines and to petabytes of storage.
  • 31.  Structured Relational vs Semi-structured vs unstructured -RDBMS works well for structured data-tables that conform to a predefined schema. -Hadoop works best on semi structured and unstructured data.  Semi-structured may have schema that is loosely followed.  Unstructured data has no structure whatsoever and Is usually blocks of text (or for example images)  At processing time types for key and values are choosen by the implementer. -Certain types of input data will not easily fit into relational schema such as JSON, XML etc. Comparing Hadoop with RDBMS (contd..)
  • 32.  Offline batch vs Online Transactions - Hadoop was not designed for real time and low latency queries. - Products that do provide low latency queries such as Hbase have limited query functionality. - Hadoop performs best for offline batch processing on large amounts of data. - RDBMS is best for online transactions and low latency queries. - Hadoop is designed to stream large files and large amounts of data. - RDBMS works best with small records. Comparing Hadoop with RDBMS (contd..)
  • 34.  Framework for running applications on large clusters of commodity hardware  Scale: petabytes of data on thousands of nodes  Include  Storage: HDFS  Processing: MapReduce Support the Map/Reduce programming model  Requirements  Economy: use cluster of comodity computers  Easy to use Users: no need to deal with the complexity of distributed computing  Reliable: can handle node failures automatically
  • 35. What's Hadoop ..Contd??  Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data.  Here's what makes Hadoop especially useful:  Scalable  Economical  Efficient  Reliable
  • 36. Hadoop, Why?  Need to process Multi Petabyte Datasets  Expensive to build reliability in each application.  Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant.  Need common infrastructure – Efficient, reliable, Open Source Apache License  The above goals are same as Condor, but Workloads are IO bound and not CPU bound
  • 37. Who uses Hadoop? • Amazon/A9 • Facebook • Google • IBM • Joost • Last.fm • New York Times • PowerSet • Veoh • Yahoo!
  • 39. HDFS  Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.)  MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.  Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.
  • 40. Goals of HDFS • Very Large Distributed File System – 10K nodes, 100 million files, 10 PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth • User Space, runs on heterogeneous OS
  • 41. Hadoop at Facebook • Production cluster – 4800 cores, 600 machines, 16GB per machine – April 2009 – 8000 cores, 1000 machines, 32 GB per machine – July 2009 – 4 SATA disks of 1 TB each per machine – 2 level network hierarchy, 40 machines per rack – Total cluster size is 2 PB, projected to be 12 PB in Q3 2009 • Test cluster • 800 cores, 16GB each
  • 42. Hadoop Architecture Data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Results Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Hadoop Cluster DFS Block 1 DFS Block 1 DFS Block 2 DFS Block 2 DFS Block 2 DFS Block 1 DFS Block 3 DFS Block 3 DFS Block 3 MAP MAP MAP Reduce
  • 44. Map/Reduce Processes  Launching Application – User application code – Submits a specific kind of Map/Reduce job  JobTracker – Handles all jobs – Makes all scheduling decisions  TaskTracker – Manager for all tasks on a given node  Task – Runs an individual map or reduce fragment for a given job – Forks from the TaskTracker
  • 45. -cont’d  Hadoop Map/Reduce – Goals: • Process large data sets • Cope with hardware failure • High throughput
  • 46. Hadoop Map-Reduce Architecture  Master-Slave architecture  Map-Reduce Master “Jobtracker” – Accepts MR jobs submitted by users – Assigns Map and Reduce tasks to Tasktrackers – Monitors task and tasktracker status, re-executes tasks upon failure  Map-Reduce Slaves “Tasktrackers” – Run Map and Reduce tasks upon instruction from the Jobtracker – Manage storage and transmission of intermediate output
  • 47.
  • 48. NameNode Metadata • Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data • Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor • A Transaction Log – Records file creations, file deletions. etc
  • 49. DataNode • A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients • Block Report – Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 50. Block Placement • Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed • Clients read from nearest replica • Would like to make this policy pluggable
  • 51. Data Correctness • Use Checksums to validate data – Use CRC32 • File Creation – Client computes checksum per 512 byte – DataNode stores the checksum • File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 52. NameNode Failure • A single point of failure • Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) • Need to develop a real HA solution
  • 53. Data Pipelining • Client retrieves a list of DataNodes on which to place replicas of a block • Client writes block to the first DataNode • The first DataNode forwards the data to the next DataNode in the Pipeline • When all replicas are written, the Client moves on to write the next block in file
  • 54. Data Flow Web Servers Scribe Servers Network Storage Hadoop ClusterOracle RAC MySQL
  • 55. Example for MapReduce  Page 1: the weather is good  Page 2: today is good  Page 3: good weather is good.
  • 56. Reduce Input  Worker 1:  (the 1)  Worker 2:  (is 1), (is 1), (is 1)  Worker 3:  (weather 1), (weather 1)  Worker 4:  (today 1)  Worker 5:  (good 1), (good 1), (good 1), (good 1)
  • 57. Reduce Output  Worker 1:  (the 1)  Worker 2:  (is 3)  Worker 3:  (weather 2)  Worker 4:  (today 1)  Worker 5:  (good 4)
  • 59. Hadoop Web Interface • MapReduce Job Tracker Web Interface The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the local machine's Hadoop log files (the machine on which the web UI is running on). By default, it's available at http://localhost:50030/ • Task Tracker Web Interface The task tracker web UI shows you running and non-running tasks. It also gives access to the local machine's Hadoop log files. By default, it's available at http://localhost:50060/ • HDFS Name Node Web Interface The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine's Hadoop log files. By default, it's available at http://localhost:50070/
  • 60.
  • 61. HBASE HBase is a database: the Hadoop database. It is indexed by row key, column key, and timestamp.  HBase stores structured and semistructured data naturally so you can load it with tweets and parsed log files and a catalog of all your products right along with their customer reviews. It can store unstructured data too, as long as it‟s not too large HBase is designed to run on a cluster of computers instead of a single computer. The cluster can be built using commodity hardware; HBase scales horizontally as you add more machines to the cluster.
  • 62. HBASE (Contd…) Each node in the cluster provides a bit of storage, a bit of cache, and a bit of computation as well. This makes HBase incredibly flexible and forgiving. No node is unique, so if one of those machines breaks down, you simply replace it with another.  This adds up to a powerful, scalable approach to data that,until now, hasn‟t been commonly available to mere mortals.
  • 63. HBASE DATA MODEL: Hbase Data model - These six concepts form the foundation of HBase. Table: HBase organizes data into tables. Table names are Strings and composed of characters that are safe for use in a file system path. Row :  Within a table, data is stored according to its row. Rows are identified uniquely by their row key. Row keys don‟t have a data type and are always treated as a byte[]. Column family:  Data within a row is grouped by column family. Column families also impact the physical arrangement of data stored in HBase.  For this reason, they must be defined up front and aren‟t easily modified. Every row in a table has the same column families, although a row need not store data in all its families. Column family names are Strings and composed of characters that are safe for use in a file system path.
  • 64.  Column qualifier:  Data within a column family is addressed via its column qualifier,or column. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between rows.  Like rowkeys, column qualifiers don‟t have a data type and are always treated as a byte[].  Cell: A combination of rowkey, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell‟s value. Values also don‟t have a data type and are always treated as a byte[].  Version:  Values within a cell are versioned. Versions are identified by their timestamp,a long. When a version isn‟t specified, the current timestamp is used as the basis for the operation. The number of cell value versions retained by HBase is configured via the column family. The default number of cell versions is three.
  • 66. HBase Tables and Regions Table is made up of any number of regions. Region is specified by its startKey and endKey.  Empty table: (Table, NULL, NULL)  Two-region table: (Table, NULL, “com.ABC.www”) and (Table, “com.ABC.www”, NULL) Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop
  • 67. YARN
  • 68. Why Next Generation MR  Reliability  Availability  Scalability - Clusters of 10,000 machines and 200,000 cores, and beyond.  Backward (and Forward) Compatibility  Ensure customers’ MapReduce applications run unchanged in the next version of the framework.  Evolution – Ability for customers to control upgrades to the Hadoop software stack.  Predictable Latency – A major customer concern.  Cluster utilization
  • 69. Why Next Generation MR  Secondary Requirements –Support for alternate programming paradigms to MapReduce. –Support for short-lived services
  • 70. ReArchitecure • Need – Separate the tasks of Job Tracker • Resource management • Job Scheduling / Management
  • 71. So, What did we come up with • Resource Manager • Node Manager • Application Master • Container
  • 72. Resource Manager (RM) Manages the global assignment of compute resources to applications.
  • 73. Resource Manager (RM) • A pure Scheduler • No monitoring, tracking status of application • No guarantee on restarting failed tasks.
  • 74. Resource Manager (RM) • Each client/application may request multiple resources – Memory – Network – Cpu – Disk .. • This is a significant change from static Mapper / Reducer model
  • 75. Application Master • A per – application ApplicationMaster (AM) that a ages the appli atio ’s life cycle (scheduling and coordination). • An application is either a single job in the classic MapReduce jobs or a DAG of such jobs.
  • 76. Application Master A per – application ApplicationMaster (AM) that a ages the appli atio ’s life cycle.
  • 77. Application Master • Application Master has the responsibility of – negotiating appropriate resource containers from the Scheduler – launching tasks – tracking their status – monitoring for progress – handling task-failures.
  • 78. Node Manager • The NodeManager is the per-machine framework agent – responsible for launching the applications‟ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.
  • 79. Gain with New Architecture • Scalability • Availability • Wire-compatibility • Innovation & Agility • Cluster Utilization • Support for programming paradigms other than MapReduce
  • 80. Gain with New Architecture • RM and Job manager segregated • The Hadoop MapReduce JobTracker spends a very significant portion of time and effort managing the life cycle of applications • Scalability • Availability • Wire-compatibility • Innovation & Agility • Cluster Utilization
  • 81. Gain with New Architecture • ResourceManage – Uses ZooKeeper for fail-over. – When primary fails, secondary can quickly start using the state stored in ZK • Application Master – MapReduce NextGen supports application specific checkpoint capabilities for the ApplicationMaster. – MapReduce ApplicationMaster can recover from failures by restoring itself from state saved in HDFS. • Scalability • Availability • Wire-compatibility • Innovation & Agility • Cluster Utilization
  • 82. Gain with New Architecture • MapReduce NextGen uses wire- compatible protocols to allow different versions of servers and clients to communicate with each other. • Rolling upgrades for the cluster in future. • Scalability • Availability • Wire-compatibility • Innovation & Agility • Cluster Utilization
  • 83. Gain with New Architecture • New framework is generic. – Can came up with non MR parallel computing techniques – Different versions of MR running in parallel  – End users can upgrade to MR versions on their own schedule • Scalability • Availability • Wire-compatibility • Innovation & Agility • Cluster Utilization
  • 84. Gain with New Architecture • MRv2 uses a general concept of a resource for scheduling and allocating to individual applications. • Container , can be a mapper or a reducer or … ? • Stubborn notion of Mapper,Reducer abolished • Better cluster utilization • Scalability • Availability • Wire-compatibility • Innovation & Agility • Cluster Utilization
  • 86.
  • 87.  When Hadoop 1.0.0 was released by Apache in 2011, comprising mainly HDFS and MapReduce, it soon became clear that Hadoop was not simply another application or service, but a platform around which an entire ecosystem of capabilities could be built. Since then, dozens of self-standing software projects have sprung into being around Hadoop, each addressing a variety of problem spaces and meeting different needs.  Many of these projects were begun by the same people or companies who were the major developers and early users of Hadoop; others were initiated by commercial Hadoop distributors. The majority of these projects now share a home with Hadoop at the Apache Software Foundation, which supports open-source software development and encourages the development of the communities surrounding these projects.
  • 88. SQOOP
  • 89. SQOOP  Data Import/ Export.  SQOOP is a tool designed to help users of large data import existing relational databases into their hadoop clusters.  Automatic data import.  Easy import data from many databases to Hadoop.  Generates code for use in Mapreduce applications. Source: Big Data Analytics with Hadoop
  • 90.  Sqoop is a tool designed to transfer data between Hadoop and relational databases.  You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. What is Sqoop?
  • 91. SQOOP 9 1 ©2011 Cloudera, Inc. All Rights Reserved. 91 RDBMS Sqoop HDFS
  • 92. HIVE
  • 93. HIVE  Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. – ETL. – Structure. – Access to different storage. – Query execution via MapReduce.  While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.  Key Building Principles: – SQL is a familiar language – Extensibility – Types, Functions, Formats, Scripts – Performance
  • 95. Hive, Why? • Need a Multi Petabyte Warehouse • Files are insufficient data abstractions – Need tables, schemas, partitions, indices • SQL is highly popular • Need for an open data format – RDBMS have a closed data format – flexible schema • Hive is a Hadoop subproject!
  • 96. Hadoop & Hive History • Dec 2004 – Google GFS paper published • July 2005 – Nutch uses MapReduce • Feb 2006 – Becomes Lucene subproject • Apr 2007 – Yahoo! on 1000-node cluster • Jan 2008 – An Apache Top Level Project • Jul 2008 – A 4000 node test cluster • Sept 2008 – Hive becomes a Hadoop subproject
  • 98. •Hive structures data into well-understood database concepts such as: tables, rows, cols, partitions •It supports primitive types: integers, floats, doubles, and strings •Hive also supports: –associative arrays: map<key-type, value-type> –Lists: list<element type> –Structs: struct<file name: file type…> •SerDe: serialize and deserialized API is used to move data in and out of tables Data model
  • 99. Query Language (HiveQL) • Subset of SQL • Meta-data queries • Limited equality and join predicates • No inserts on existing tables (to preserve worm property) – Can overwrite an entire table
  • 100. Hive - DDL  Create table hive> CREATE TABLE customer (age INT, address STRING);  Partitions hive> CREATE TABLE customer (age INT, address STRING) PARTITIONED BY ( sdate STRING) ;  Show table hive> SHOW TABLES ;  Describe table hive> DESCRIBE customer;
  • 101. Hive - DDL  Alter table hive> ALTER TABLE customer ADD COLUMNS ( age INT) ;  Drop table hive> DROP TABLE customer;
  • 102. HiveQL Examples HiveQL, an SQL like language hive> SELECT a.age FROM customer a WHERE a.sdate ='2008-08- 15'; selects all data from table for a partition but doesnt store it hive> INSERT OVERWRITE DIRECTORY '/data/hdfs_file' SELECT a.* FROM customer a WHERE a.sdate='2008-08-15'; writes all of customer table to an hdfs directory
  • 103. Wordcount in Hive FROM ( MAP doctext USING 'python wc_mapper.py' AS (word, cnt) FROM docs CLUSTER BY word ) a REDUCE word, cnt USING 'pythonwc_reduce.py';
  • 104. Hive Usage in Facebook • Hive and Hadoop are extensively used in Facbook for different kinds of operations. • 700 TB = 2.1Petabyte after replication! • Think of other application model that can leverage Hadoop MR.
  • 105. Hive – Related Projects  Apache Flume – move large data sets to Hadoop  Apache Sqoop – cmd line, move rdbms data to Hadoop  Apache Hbase – Non relational database  Apache Pig – analyse large data sets  Apache Oozie – work flow scheduler  Apache Mahout – machine learning and data mining  Apache Hue – Hadoop user interface  Apache Zoo Keeper – configuration / build
  • 106. PIG
  • 107. Introduction • What is Pig? – An open-source high-level dataflow system – Provides a simple language for queries and data manipulation, Pig Latin, that is compiled into map-reduce jobs that are run on Hadoop – Pig Latin combines the high-level data manipulation constructs of SQL with the procedural programming of map-reduce • Why is it important? – Companies and organizations like Yahoo, Google and Microsoft are collecting enormous data sets in the form of click streams, search logs, and web crawls – Some form of ad-hoc processing and analysis of all of this information is required
  • 108. Existing Solutions • Parallel database products (ex: Teradata) – Expensive at web scale – Data analysis programmers find the declarative SQL queries to be unnatural and restrictive • Raw map-reduce – Complex n-stage dataflows are not supported; joins and related tasks require workarounds or custom implementations – Resulting code is difficult to reuse and maintain; shifts focus and attention away from data analysis
  • 109. Language Features • Several options for user-interaction – Interactive mode (console) – Batch mode (prepared script files containing Pig Latin commands) – Embedded mode (execute Pig Latin commands within a Java program) • Built primarily for scan-centric workloads and read-only data analysis – Easily operates on both structured and schema-less, unstructured data – Transactional consistency and index-based lookups not required – Data curation and schema management can be overkill • Flexible, fully nested data model • Extensive UDF support – Currently must be written in Java – Can be written for filtering, grouping, per-tuple processing, loading and storing
  • 110. Pig Latin vs. SQL • Pig Latin is procedural (dataflow programming model) – Step-by-step query style is much cleaner and easier to write and follow than trying to wrap everything into a single block of SQL Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html
  • 111. Pig Latin vs. SQL (continued) • Lazy evaluation (data not processed prior to STORE command) • Data can be stored at any point during the pipeline • An execution plan can be explicitly defined – No need to rely on the system to choose the desired plan via optimizer hints • Pipeline splits are supported – SQL requires the join to be run twice or materialized as an intermediate result Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html
  • 112. Data Model • Supports four basic types – Atom: a simple atomic value (int, long, double, string) • ex: „Peter‟ – Tuple: a sequence of fields that can be any of the data types • ex: („Peter‟, 14) – Bag: a collection of tuples of potentially varying structures, can contain duplicates • ex: {(„Peter‟), („Bob‟, (14, 21))} – Map: an associative array, the key must be a chararray but the value can be any type
  • 113. Data Model (continued) • By default Pig treats undeclared fields as bytearrays (collection of uninterpreted bytes) • Ca i fer a field’s type ased o : – Use of operators that expect a certain type of field – UDFs with a known or explicitly set return type – Schema information provided by a LOAD function or explicitly declared using an AS clause • Type conversion is lazy
  • 114. Pig problem • Fragment-replicate; skewed; merge join • User has to know when to use which join • Because… Pig is domestic animal, does whatever you tell it to do. - Alan Gates Images from http://wiki.apache.org/pig/PigTalksPapers
  • 115. HUE
  • 116. Hue – What is it ?  Hue = Hadoop User Experience  Hue is an open-source Web interface that supports Apache Hadoop and its ecosystem, licensed under the Apache v2 license.  Its main goal is to have the users "just use" Hadoop without worrying about the underlying complexity or using a command line  An open source Hadoop GUI  Developed by Cloudera  Web based  Many functions
  • 117. Hue – Why ???  It is widely used  It ships with Hadoop  It integrates with Hadoop tools i.e.  Hive  Oozie  HDFS  It has an API for app creation
  • 118. Hue Features  HDFS file browser  Job browser / designer  Hive / Pig query editor  Oozie app for work flows  Has Hadoop API  Access to shell  User Admin  App for Solr searches
  • 124. FLUME
  • 125. What is Apache Flume? ● It is a distributed data collection service that gets flows of data (like logs) from their source and aggregates them to where they have to be processed. ● Goals: reliability, scalability, extensibility, manageability. Exactly what I needed!
  • 126. The Flume Model: Flows and Nodes ● A flow corresponds to a type of data source (server logs, machine monitoring metrics...). ● Flows are comprised of nodes chained together.
  • 127. The Flume Model: Flows and Nodes ● In a Node, data come in through a source... ...are optionally processed by one or more decorators... ...and then are transmitted out via a sink. Examples: Console, Exec, Syslog, IRC, Twitter, other nodes... Examples: Console, local files, HDFS, S3, other nodes... Examples: wire batching, compression, sampling, projection, extraction...
  • 128. ● Agent: receives data from an application. ● Processor (optional): intermediate processing. ● Collector: write data to permanent storage. The Flume Model: Agent, Processor and Collector Nodes
  • 129. The Flume Model: Data and Control Path (1/2) Nodes are in the data path.
  • 130. The Flume Model: Data and Control Path (2/2) Masters are in the control path. ● Centralized point of configuration. Multiple: ZK. ● Specify sources, sinks and control data flows.
  • 131. Flume Goals: Reliability Tunable Failure Recovery Modes ● Best Effort ● Store on Failure and Retry ● End to End Reliability
  • 132. Flume Goals: Scalability Horizontally Scalable Data Path Load Balancing
  • 133. Flume Goals: Scalability Horizontally Scalable Control Path
  • 134. Flume Goals: Extensibility  Simple Source and Sink API  Event streaming and composition of simple operation  Plug in Architecture  Add your own sources, sinks, decorators
  • 135. Conclusion Flume is ● Distributed data collection service ● Suitable for enterprise setting ● Large amount of log data to process
  • 136. Conclusion  Big data is here to stay, It is impossible to imagine the next generation without it consuming data, producing new forms of data and containing data driven algorithms.  As compute environment become cheaper, application environment becomes networked over cloud. So security, access control, compression and encryption introduce challenges that have to be addressed in a systematic manner.
  • 137. References [1] Chris Eaton, Dirk Deroos, Tom Deutsh, George Lapis, Paul Zikopoulos, Understanding Big Data, Analysis for enterprise class hadoop and streaming data, pp.3-49. [2] Mike Barlow, Real time data analystics, Emerging Architecture, February 2013, First edition, pp.1-21. [3] Sachidanand Singh, Nirmala Singh, Big Data Analytics,2012 International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India [4] Big Data Introduction, www.youtube.com/watch?v=e6kovHZ6FVc [5] Hadoop Video, www.youtube.com/watch?v=OoEpfbbyga8 [6] Cloud Security Alliance, Big Data Security and privacy issues, November 2012. [7] http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/ [8] http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html [9] http://www.youtube.com/watch?v=5Eib_H_zCEY&feature=related [10] http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=related [11] http://labs.google.com/papers/gfs-sosp2003.pdf [12] http://hadoop.apache.org/core/docs/current/hdfs_design.html [13] http://hadoop.apache.org/core/docs/current/api/ [14] http://hadoop.apache.org/hive/ [15]http://www.cloudera.com/resource/chicago_data_summit_flume_an_introduction_jonathan_h sieh_hadoop_log_processing [16] http://www.slideshare.net/cloudera/inside-flume [17] http://www.slideshare.net/cloudera/flume-intro100715 [18] http://www.slideshare.net/cloudera/flume-austin-hug-21711