2. CONTENTS (1/2)
Big Data Definition
Areas Of Challenges
Big data Attributes.
Big data Source.
Sample Events generating data
New tools for generating data
Big data applications.
Getting value from big data.
Big data security
Comparing Hadoop With RDBMS
Hadoop
4. WHAT IS BIG DATA???
BIG DATA
So Large Data That It
Becomes Difficult To
Process It using
Traditional Systems
Is
SOURCE: PLANNING FOR BIG DATA, EDD DUMBILL, PP.1-4
6. DIFFICULT TO PROCESS BY TRADITIONAL
SYSTEM
200 MB DOCUMENT 150 GB IMAGE
200 TB VIDEO
Unable To
Send
Unable To
View
Unable To
Edit
Depends On
the Capabilities
of the System
7. ORGANIZATION SPECIFIC
500TB Text, Audio, Video Data
Per Day
BIG
DATA
NOT
A
BIG
DATA
COMPANY 1 COMPANY 2
Depends On
the Capabilities
of the
Organization
13. DATA GENERATION POINT
EXAMPLES
MOBILE DEVICES
MICROPHONES
READERS/SCANNERS
CAMERAS
MACHINE SENSORS
SOCIAL MEDIA
PROGRAMS/ SOFTWARES
SCIENCE FACILITIES
14. SAMPLE DATA TYPES
VIDEOS
AUDIOS
IMAGES
PHOTOS
LOGS
CLICK TRAILS
TEXT MESSAGES
EMAILS
DOCUMENTS
BOOKS
TRANSACTIONS
PUBLIC RECORDS
15. SAMPLE EVENTS GENERATING DATA
1)Air Bus:
Airbus generates 10TB every 30 minutes
About 640TB is generated in one flight
2) Smart Meters:
Smart Meter reads the usage every 15 minutes
Records 350 Billion Transaction every year.
In 2009, there were 76 million smart meters.
By 2014, there will be 200 million smart meters
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
16. 3) Camera Phones:
5 million camera phones are there world wide.
Most of them have location Awareness ( G.P.S)
22% of them are Smartphone's.
By the End of 2013 the number of Smartphone's
will exceed the number of PC‟s
4) Internet Users:
2+ billion people use internet.
By 2014 CISCO estimates internet traffic 4.8
Zettabytes per year
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
17. 5) Blogs:
There are 200 billion blog entries in the world.
6) Emails:
300 million Emails are sent every day.
7) RFID:
In 2005, there were around 1.5 million RFID‟s
In 2012, there are 30 million RFID‟s
WalMart as played
the major role
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
18. 8) Facebook:
Facebook generates 25TB of data daily.
9) Twitter:
Twitter generates 12TB of data daily.
200 million users generating 230 million tweets daily.
97,000 tweets are sent every seconds.
10) Trading:
NYSE produces 1TB per trading day.
11) Experiment:
CERN atomic facility generates 40TB per second.
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
19. SAMPLE EVENTS GENERATING DATA
Big Data:
In 2009, the total data was estimated to be 1 ZB
In 2020, it is estimated to be 35ZB
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
20. New Tools For Big Data
TRADITIONAL
SYSTEMS
(E.g.,RDBMS)
BIG DATA
TOOLS
(E.g.,
HADOOP)
TIMEE
Not able to
handle Big
Data
Created to
handle Big
Data
21. Big Data Applications
Companies gaining edge by collecting,
analyzing and understanding information.
Government forecasting events and taking
proactive actions.
23. Big Data Security Issues
Security and privacy issues are magnified by the V
attributes.
Velocity
Volume
Variety
Traditional Security mechanisms which are tailoured
to securing small scale static data are inadequate.
SOURCE: CLOUD SECURITY ALLIANCES
24. Top Five Security Challenges
1) Secure Computation in Distributed
Programming framework:
Distributed programming framework utilizes parallism in
computation and storage to process massive amount of data.
Example: MAPREDUCE Framework:
Splits input files into multiple chuncks.
These chunks are read by the mapper and outputs key/value pairs.
The reducer combines the values belonging to distinct key and outputs the
result.
OPPORTUNITY 1: Two Major prevention measure arises
1) Securing Mapper
2) Securing the data in the presence of untrusted mapper
SOURCE: CLOUD SECURITY ALLIANCES
25. 2) Input Validation/Filtering
Input Validation:
What kind of data is untrusted?
What are the untrusted data sources?
Data Filtering:
Filter Rogue or malicious data.
Challenges/ opportunity
GB‟s or TB‟s Continuous data
Signature based data filtering has limitations
SOURCE: CLOUD SECURITY ALLIANCES
26. 3) Secure Data storage
Data at various nodes, authentication, authorization and
encryption is challenging.
Autotiering moves cold data into lesser secure medium
o What if the cold data is sensitive?
Autotier doesnot keep track of where the data is stored.
(new challenge)
Encryption of real time data can have performance
impact.
Challenges/opportunity:
24/7 availability of data
unauthorized access
SOURCE: CLOUD SECURITY ALLIANCES
27. 4) Privacy concern in data mining.
Sharing of results involve multiple challenges.
o Invasion of privacy.
o Invasive Marketing.
o Unintentional disclosure of Information.
Example: Companies and government agencies they
constantly mined and analyzed by the inside analysts
and also potentially outside contractors
Challenges/Opportunity: Robust and scalable
privacy preserving mining algorithms
SOURCE: CLOUD SECURITY ALLIANCES
28. 5) Cryptography enforced access
control and secure communication
To ensure end to end secure private data.
Accessible to only authorized entity.
Hence Cryptography enforced access control
has to be implemented.
Challenges/ opportunities: The main
problem to encrypt data especially large data
sets, is all-or-nothing retrieval policy,
disallowing user to easily search or share data.
SOURCE: CLOUD SECURITY ALLIANCES
30. Comparing Hadoop with RDBMS
Until recently many applications utilized Relational
Database systems (RDBMS) for batch processing.
-Oracle, Sybase, MySQL, Microsoft SQL, Server etc.
-Hadoop doesn‟t fully replace relational products; many
architectures would benefit from both hadoop and
Relational product(s).
Scale-Out vs Scale-up
-RDBMS products scale up
Expensive to scale for large installation.
Hits a ceiling when storage reaches 100s of terabytes.
- Hadoop clusters can scale-out to 100s of machines and to
petabytes of storage.
31. Structured Relational vs Semi-structured vs unstructured
-RDBMS works well for structured data-tables that conform to
a predefined schema.
-Hadoop works best on semi structured and unstructured data.
Semi-structured may have schema that is loosely followed.
Unstructured data has no structure whatsoever and Is usually
blocks of text (or for example images)
At processing time types for key and values are choosen by
the implementer.
-Certain types of input data will not easily fit into relational
schema such as JSON, XML etc.
Comparing Hadoop with RDBMS (contd..)
32. Offline batch vs Online Transactions
- Hadoop was not designed for real time and low latency
queries.
- Products that do provide low latency queries such as Hbase
have limited query functionality.
- Hadoop performs best for offline batch processing on large
amounts of data.
- RDBMS is best for online transactions and low latency
queries.
- Hadoop is designed to stream large files and large amounts of
data.
- RDBMS works best with small records.
Comparing Hadoop with RDBMS (contd..)
34. Framework for running applications on large clusters of
commodity hardware
Scale: petabytes of data on thousands of nodes
Include
Storage: HDFS
Processing: MapReduce
Support the Map/Reduce programming model
Requirements
Economy: use cluster of comodity computers
Easy to use
Users: no need to deal with the complexity of
distributed computing
Reliable: can handle node failures automatically
35. What's Hadoop ..Contd??
Hadoop is a software platform that lets one easily write
and run applications that process vast amounts of data.
Here's what makes Hadoop especially useful:
Scalable
Economical
Efficient
Reliable
36. Hadoop, Why?
Need to process Multi Petabyte Datasets
Expensive to build reliability in each application.
Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
Need common infrastructure
– Efficient, reliable, Open Source Apache License
The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
37. Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
39. HDFS
Hadoop implements MapReduce, using the Hadoop
Distributed File System (HDFS) (see figure below.)
MapReduce divides applications into many small blocks of
work. HDFS creates multiple replicas of data blocks for
reliability, placing them on compute nodes around the
cluster. MapReduce can then process the data where it is
located.
Hadoop has been demonstrated on clusters with 2000
nodes. The current design target is 10,000 node clusters.
40. Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to
where data resides
– Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS
41. Hadoop at Facebook
• Production cluster
– 4800 cores, 600 machines, 16GB per machine – April 2009
– 8000 cores, 1000 machines, 32 GB per machine – July
2009
– 4 SATA disks of 1 TB each per machine
– 2 level network hierarchy, 40 machines per rack
– Total cluster size is 2 PB, projected to be 12 PB in Q3
2009
• Test cluster
• 800 cores, 16GB each
42. Hadoop Architecture
Data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Results
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Hadoop Cluster
DFS Block 1
DFS Block 1
DFS Block 2
DFS Block 2
DFS Block 2
DFS Block 1
DFS Block 3
DFS Block 3
DFS Block 3
MAP
MAP
MAP
Reduce
44. Map/Reduce Processes
Launching Application
– User application code
– Submits a specific kind of Map/Reduce job
JobTracker
– Handles all jobs
– Makes all scheduling decisions
TaskTracker
– Manager for all tasks on a given node
Task
– Runs an individual map or reduce fragment for a
given job
– Forks from the TaskTracker
45. -cont’d
Hadoop Map/Reduce – Goals:
• Process large data sets
• Cope with hardware failure
• High throughput
46. Hadoop Map-Reduce Architecture
Master-Slave architecture
Map-Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, re-executes tasks upon
failure
Map-Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction from the Jobtracker
– Manage storage and transmission of intermediate output
47.
48. NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
49. DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the
NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
50. Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica
• Would like to make this policy pluggable
51. Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
52. NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
53. Data Pipelining
• Client retrieves a list of DataNodes on which to place
replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next DataNode
in the Pipeline
• When all replicas are written, the Client moves on to write
the next block in file
59. Hadoop Web Interface
• MapReduce Job Tracker Web Interface
The job tracker web UI provides information about general job statistics of
the Hadoop cluster, running/completed/failed jobs and a job history log file.
It also gives access to the local machine's Hadoop log files (the machine on
which the web UI is running on).
By default, it's available at http://localhost:50030/
• Task Tracker Web Interface
The task tracker web UI shows you running and non-running tasks. It also
gives access to the local machine's Hadoop log files.
By default, it's available at http://localhost:50060/
• HDFS Name Node Web Interface
The name node web UI shows you a cluster summary including information
about total/remaining capacity, live and dead nodes. Additionally, it allows
you to browse the HDFS namespace and view the contents of its files in the
web browser. It also gives access to the local machine's Hadoop log files.
By default, it's available at http://localhost:50070/
60.
61. HBASE
HBase is a database: the Hadoop database. It is indexed by row key,
column key, and timestamp.
HBase stores structured and semistructured data naturally so you can
load it with tweets and parsed log files and a catalog of all your products
right along with their customer reviews.
It can store unstructured data too, as long as it‟s not too large
HBase is designed to run on a cluster of computers instead of a single
computer. The cluster can be built using commodity hardware; HBase
scales horizontally as you add more machines to the cluster.
62. HBASE (Contd…)
Each node in the cluster provides a bit of storage, a bit of cache,
and a bit of computation as well.
This makes HBase incredibly flexible and forgiving. No node is
unique, so if one of those machines breaks down, you simply
replace it with another.
This adds up to a powerful, scalable approach to data that,until
now, hasn‟t been commonly available to mere mortals.
63. HBASE DATA MODEL:
Hbase Data model - These six concepts form the foundation of HBase.
Table:
HBase organizes data into tables. Table names are Strings and composed of characters
that are safe for use in a file system path.
Row :
Within a table, data is stored according to its row. Rows are identified uniquely by
their row key. Row keys don‟t have a data type and are always treated as a byte[].
Column family:
Data within a row is grouped by column family. Column families also impact the
physical arrangement of data stored in HBase.
For this reason, they must be defined up front and aren‟t easily modified. Every row
in a table has the same column families, although a row need not store data in all its
families. Column family names are Strings and composed of characters that are safe for
use in a file system path.
64. Column qualifier:
Data within a column family is addressed via its column qualifier,or column.
Column qualifiers need not be specified in advance. Column qualifiers need not be
consistent between rows.
Like rowkeys, column qualifiers don‟t have a data type and are always treated as a
byte[].
Cell:
A combination of rowkey, column family, and column qualifier uniquely identifies
a cell. The data stored in a cell is referred to as that cell‟s value. Values also don‟t
have a data type and are always treated as a byte[].
Version:
Values within a cell are versioned. Versions are identified by their timestamp,a
long. When a version isn‟t specified, the current timestamp is used as the basis for the
operation. The number of cell value versions retained by HBase is configured via the
column family. The default number of cell versions is three.
66. HBase Tables and Regions
Table is made up of any number of regions.
Region is specified by its startKey and endKey.
Empty table: (Table, NULL, NULL)
Two-region table: (Table, NULL, “com.ABC.www”) and
(Table, “com.ABC.www”, NULL)
Each region may live on a different node and is made up of
several HDFS files and blocks, each of which is replicated by
Hadoop
68. Why Next Generation MR
Reliability
Availability
Scalability - Clusters of 10,000 machines and 200,000
cores, and beyond.
Backward (and Forward) Compatibility
Ensure customers’ MapReduce applications run
unchanged in the next version of the framework.
Evolution – Ability for customers to control upgrades to
the Hadoop software stack.
Predictable Latency – A major customer concern.
Cluster utilization
69. Why Next Generation MR
Secondary Requirements
–Support for alternate programming
paradigms to MapReduce.
–Support for short-lived services
73. Resource Manager (RM)
• A pure Scheduler
• No monitoring, tracking
status of application
• No guarantee on restarting
failed tasks.
74. Resource Manager (RM)
• Each client/application may
request multiple resources
– Memory
– Network
– Cpu
– Disk ..
• This is a significant change
from static Mapper /
Reducer model
75. Application Master
• A per – application
ApplicationMaster (AM) that
a ages the appli atio ’s life
cycle (scheduling and
coordination).
• An application is either a single
job in the classic MapReduce
jobs or a DAG of such jobs.
76. Application Master
A per – application
ApplicationMaster (AM) that
a ages the appli atio ’s life
cycle.
77. Application Master
• Application Master has the
responsibility of
– negotiating appropriate resource
containers from the Scheduler
– launching tasks
– tracking their status
– monitoring for progress
– handling task-failures.
78. Node Manager
• The NodeManager is the per-machine
framework agent
– responsible for launching the
applications‟ containers,
monitoring their resource usage
(cpu, memory, disk, network) and
reporting the same to the
Scheduler.
79. Gain with New Architecture
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
• Support for programming paradigms other than MapReduce
80. Gain with New Architecture
• RM and Job manager segregated
• The Hadoop MapReduce JobTracker
spends a very significant portion of
time and effort managing the life
cycle of applications
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
81. Gain with New Architecture
• ResourceManage
– Uses ZooKeeper for fail-over.
– When primary fails, secondary can
quickly start using the state stored
in ZK
• Application Master
– MapReduce NextGen supports
application specific checkpoint
capabilities for the
ApplicationMaster.
– MapReduce ApplicationMaster can
recover from failures by restoring
itself from state saved in HDFS.
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
82. Gain with New Architecture
• MapReduce NextGen uses wire-
compatible protocols to allow
different versions of servers and
clients to communicate with
each other.
• Rolling upgrades for the cluster
in future.
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
83. Gain with New Architecture
• New framework is generic.
– Can came up with non MR parallel
computing techniques
– Different versions of MR running in
parallel
– End users can upgrade to MR versions
on their own schedule
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
84. Gain with New
Architecture
• MRv2 uses a general concept of a
resource for scheduling and allocating to
individual applications.
• Container , can be a mapper or a reducer
or … ?
• Stubborn notion of Mapper,Reducer
abolished
• Better cluster utilization
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
87. When Hadoop 1.0.0 was released by Apache in 2011, comprising
mainly HDFS and MapReduce, it soon became clear that Hadoop
was not simply another application or service, but a platform
around which an entire ecosystem of capabilities could be built.
Since then, dozens of self-standing software projects have sprung
into being around Hadoop, each addressing a variety of problem
spaces and meeting different needs.
Many of these projects were begun by the same people or
companies who were the major developers and early users of
Hadoop; others were initiated by commercial Hadoop distributors.
The majority of these projects now share a home with Hadoop at
the Apache Software Foundation, which supports open-source
software development and encourages the development of the
communities surrounding these projects.
89. SQOOP
Data Import/ Export.
SQOOP is a tool designed to help
users of large data import existing
relational databases into their hadoop
clusters.
Automatic data import.
Easy import data from many
databases to Hadoop.
Generates code for use in Mapreduce
applications.
Source: Big Data Analytics with Hadoop
90. Sqoop is a tool designed to transfer data between Hadoop
and relational databases.
You can use Sqoop to import data from a relational database
management system (RDBMS) such as MySQL or Oracle into
the Hadoop Distributed File System (HDFS), transform the
data in Hadoop MapReduce, and then export the data back
into an RDBMS.
What is Sqoop?
93. HIVE
Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and
analysis.
– ETL.
– Structure.
– Access to different storage.
– Query execution via MapReduce.
While initially developed by Facebook, Apache Hive is now
used and developed by other companies such as Netflix.
Key Building Principles:
– SQL is a familiar language
– Extensibility – Types, Functions, Formats, Scripts
– Performance
95. Hive, Why?
• Need a Multi Petabyte Warehouse
• Files are insufficient data abstractions
– Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!
96. Hadoop & Hive History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject
98. •Hive structures data into well-understood database concepts
such as: tables, rows, cols, partitions
•It supports primitive types: integers, floats, doubles, and
strings
•Hive also supports:
–associative arrays: map<key-type, value-type>
–Lists: list<element type>
–Structs: struct<file name: file type…>
•SerDe: serialize and deserialized API is used to move data
in and out of tables
Data model
99. Query Language (HiveQL)
• Subset of SQL
• Meta-data queries
• Limited equality and join predicates
• No inserts on existing tables (to preserve
worm property)
– Can overwrite an entire table
101. Hive - DDL
Alter table
hive> ALTER TABLE customer ADD COLUMNS ( age INT) ;
Drop table
hive> DROP TABLE customer;
102. HiveQL Examples
HiveQL, an SQL like language
hive> SELECT a.age FROM customer a WHERE a.sdate ='2008-08-
15';
selects all data from table for a partition but doesnt store it
hive> INSERT OVERWRITE DIRECTORY '/data/hdfs_file'
SELECT a.* FROM customer a WHERE a.sdate='2008-08-15';
writes all of customer table to an hdfs directory
103. Wordcount in Hive
FROM (
MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
FROM docs
CLUSTER BY word
) a
REDUCE word, cnt USING 'pythonwc_reduce.py';
104. Hive Usage in Facebook
• Hive and Hadoop are extensively used in Facbook for
different kinds of operations.
• 700 TB = 2.1Petabyte after replication!
• Think of other application model that can leverage
Hadoop MR.
105. Hive – Related Projects
Apache Flume – move large data sets to Hadoop
Apache Sqoop – cmd line, move rdbms data to Hadoop
Apache Hbase – Non relational database
Apache Pig – analyse large data sets
Apache Oozie – work flow scheduler
Apache Mahout – machine learning and data mining
Apache Hue – Hadoop user interface
Apache Zoo Keeper – configuration / build
107. Introduction
• What is Pig?
– An open-source high-level dataflow system
– Provides a simple language for queries and data
manipulation, Pig Latin, that is compiled into map-reduce
jobs that are run on Hadoop
– Pig Latin combines the high-level data manipulation
constructs of SQL with the procedural programming of
map-reduce
• Why is it important?
– Companies and organizations like Yahoo, Google and
Microsoft are collecting enormous data sets in the form of
click streams, search logs, and web crawls
– Some form of ad-hoc processing and analysis of all of this
information is required
108. Existing Solutions
• Parallel database products (ex: Teradata)
– Expensive at web scale
– Data analysis programmers find the declarative SQL
queries to be unnatural and restrictive
• Raw map-reduce
– Complex n-stage dataflows are not supported; joins
and related tasks require workarounds or custom
implementations
– Resulting code is difficult to reuse and maintain; shifts
focus and attention away from data analysis
109. Language Features
• Several options for user-interaction
– Interactive mode (console)
– Batch mode (prepared script files containing Pig Latin commands)
– Embedded mode (execute Pig Latin commands within a Java program)
• Built primarily for scan-centric workloads and read-only data
analysis
– Easily operates on both structured and schema-less, unstructured data
– Transactional consistency and index-based lookups not required
– Data curation and schema management can be overkill
• Flexible, fully nested data model
• Extensive UDF support
– Currently must be written in Java
– Can be written for filtering, grouping, per-tuple processing, loading and
storing
110. Pig Latin vs. SQL
• Pig Latin is procedural (dataflow programming model)
– Step-by-step query style is much cleaner and easier to write and
follow than trying to wrap everything into a single block of
SQL
Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html
111. Pig Latin vs. SQL (continued)
• Lazy evaluation (data not processed prior to STORE command)
• Data can be stored at any point during the pipeline
• An execution plan can be explicitly defined
– No need to rely on the system to choose the desired plan via optimizer hints
• Pipeline splits are supported
– SQL requires the join to be run twice or materialized as an intermediate result
Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html
112. Data Model
• Supports four basic types
– Atom: a simple atomic value (int, long, double, string)
• ex: „Peter‟
– Tuple: a sequence of fields that can be any of the data types
• ex: („Peter‟, 14)
– Bag: a collection of tuples of potentially varying structures,
can contain duplicates
• ex: {(„Peter‟), („Bob‟, (14, 21))}
– Map: an associative array, the key must be a chararray but
the value can be any type
113. Data Model (continued)
• By default Pig treats undeclared fields as bytearrays
(collection of uninterpreted bytes)
• Ca i fer a field’s type ased o :
– Use of operators that expect a certain type of field
– UDFs with a known or explicitly set return type
– Schema information provided by a LOAD function
or explicitly declared using an AS clause
• Type conversion is lazy
114. Pig problem
• Fragment-replicate; skewed; merge join
• User has to know when to use which join
• Because… Pig is
domestic animal,
does whatever
you tell it to do.
- Alan Gates
Images from http://wiki.apache.org/pig/PigTalksPapers
116. Hue – What is it ?
Hue = Hadoop User Experience
Hue is an open-source Web interface that supports Apache
Hadoop and its ecosystem, licensed under the Apache v2 license.
Its main goal is to have the users "just use" Hadoop without
worrying about the underlying complexity or using a command
line
An open source Hadoop GUI
Developed by Cloudera
Web based
Many functions
117. Hue – Why ???
It is widely used
It ships with Hadoop
It integrates with Hadoop tools i.e.
Hive
Oozie
HDFS
It has an API for app creation
118. Hue Features
HDFS file browser
Job browser / designer
Hive / Pig query editor
Oozie app for work flows
Has Hadoop API
Access to shell
User Admin
App for Solr searches
125. What is Apache Flume?
● It is a distributed data collection service that gets
flows of data (like logs) from their source and
aggregates them to where they have to be processed.
● Goals: reliability, scalability, extensibility,
manageability.
Exactly what I needed!
126. The Flume Model: Flows and
Nodes
● A flow corresponds to a type of data source (server
logs, machine monitoring metrics...).
● Flows are comprised of nodes chained together.
127. The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
...are optionally processed by one or more decorators... ...and
then are transmitted out via a sink.
Examples: Console, Exec, Syslog, IRC,
Twitter, other nodes...
Examples: Console, local files, HDFS,
S3,
other nodes...
Examples: wire batching, compression,
sampling, projection, extraction...
128. ● Agent:
receives data from an
application.
● Processor (optional):
intermediate processing.
● Collector:
write data to permanent
storage.
The Flume Model: Agent, Processor and
Collector Nodes
129. The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.
130. The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.
134. Flume Goals: Extensibility
Simple Source and Sink API
Event streaming and composition of simple
operation
Plug in Architecture
Add your own sources, sinks, decorators
136. Conclusion
Big data is here to stay, It is impossible to imagine
the next generation without it consuming data,
producing new forms of data and containing data
driven algorithms.
As compute environment become cheaper,
application environment becomes networked over
cloud. So security, access control, compression and
encryption introduce challenges that have to be
addressed in a systematic manner.
137. References
[1] Chris Eaton, Dirk Deroos, Tom Deutsh, George Lapis, Paul Zikopoulos, Understanding Big
Data, Analysis for enterprise class hadoop and streaming data, pp.3-49.
[2] Mike Barlow, Real time data analystics, Emerging Architecture, February 2013, First edition,
pp.1-21.
[3] Sachidanand Singh, Nirmala Singh, Big Data Analytics,2012 International Conference on
Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai,
India
[4] Big Data Introduction, www.youtube.com/watch?v=e6kovHZ6FVc
[5] Hadoop Video, www.youtube.com/watch?v=OoEpfbbyga8
[6] Cloud Security Alliance, Big Data Security and privacy issues, November 2012.
[7] http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
[8] http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html
[9] http://www.youtube.com/watch?v=5Eib_H_zCEY&feature=related
[10] http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=related
[11] http://labs.google.com/papers/gfs-sosp2003.pdf
[12] http://hadoop.apache.org/core/docs/current/hdfs_design.html
[13] http://hadoop.apache.org/core/docs/current/api/
[14] http://hadoop.apache.org/hive/
[15]http://www.cloudera.com/resource/chicago_data_summit_flume_an_introduction_jonathan_h
sieh_hadoop_log_processing
[16] http://www.slideshare.net/cloudera/inside-flume
[17] http://www.slideshare.net/cloudera/flume-intro100715
[18] http://www.slideshare.net/cloudera/flume-austin-hug-21711