SlideShare une entreprise Scribd logo
1  sur  88
© Hortonworks Inc. 2013
Chris Harris
Twitter : cj_harris5
E-mail : charris@hortonworks.com
Page 1
Introduction to Big Data with
Hadoop
© Hortonworks Inc. 2013
What is Big Data?
Page 2
© Hortonworks Inc. 2013
Web giants proved the ROI in data products
applying data science to large amounts of data
Page 3
Amazon: 35% of
product sales come
from product
recommendations
Netflix: 75% of streaming
video results from
recommendations
Prediction of click
through rates
© Hortonworks Inc. 2013
Data science is a natural next step after
business intelligence
Page 4
Value
Refine Extract Enrich
Data Science
Dashboards
Reports
Score-cards
Affinity Analysis
Outlier Detection
Clustering
Recommendation
Regression
Classification
Business Intelligence: measure & count; simple analytics
Data Science: discovery & prediction; complex analytics; “data product”
Discovery
Prediction
© Hortonworks Inc. 2013
Key use-cases in Finance/Insurance
• Customer risk profiling:
–How likely is this customer to pay back his mortgage?
–How likely is this customer to get sick?
• Fraud detection:
–Detect illegal credit card activity and alert bank/consumer
–Detect illegal insurance claims
• Internal fraud detection (compliance):
–Is this employee accessing financial information they are not
allowed to access?
Page 5
© Hortonworks Inc. 2013
Key use-cases in Telco/Mobile
• Customer life-time-value prediction
–What is the LTV for customer X?
• Marketing
–Which new mobile phone should we offer to customer X so that they
remain with us?
–Location based advertising
• Failure prediction
–When will equipment X in cell tower Y fail?
• Cell Tower Management
–Predict load and bandwidth on cell towers to optimize network
Page 6
© Hortonworks Inc. 2013
Key use-cases in Healthcare
• Clinical Decision Support:
–What is the ideal treatment for this patient?
• Cost management:
–What is the expected overall cost of treatment for this patient
over the life of the disease
• Diagnostics:
–Given these test results, what is the likelihood of cancer?
• Epidemic management
–Predict size and location of epidemic spread
Page 7
© Hortonworks Inc. 2013
What is Hadoop?
Page 8
© Hortonworks Inc. 2013
A Brief History of Apache Hadoop
Page 9
2013
Focus on INNOVATION
2005: Yahoo! creates
team under E14 to
work on Hadoop
Focus on OPERATIONS
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Yahoo! begins to
Operate at scale
Enterprise
Hadoop
Apache Project
Established
Hortonworks
Data Platform
2004 2008 2010 20122006
STABILITY
2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24
key Hadoop engineers from Yahoo
© Hortonworks Inc. 2013
Leadership that Starts at the Core
Page 10
• Driving next generation Hadoop
– YARN, MapReduce2, HDFS2, High
Availability, Disaster Recovery
• 420k+ lines authored since 2006
– More than twice nearest contributor
• Deeply integrating w/ecosystem
– Enabling new deployment platforms
– (ex. Windows & Azure, Linux & VMware HA)
– Creating deeply engineered solutions
– (ex. Teradata big data appliance)
• All Apache, NO holdbacks
– 100% of code contributed to Apache
© Hortonworks Inc. 2013
Operational Data Refinery
Page 11
DATASYSTEMSDATASOURCES
1
3
1 Capture
Capture all data
Process
Parse, cleanse, apply
structure & transform
Exchange
Push to existing data
warehouse for use with
existing analytic tools
2
3
Refine Explore
Enric
h
2
APPLICATIONS
Collect data and apply
a known algorithm to it
in trusted operational
process
TRADITIONAL REPOS
RDBMS EDW MPP
Business
Analytics
Custom
Applications
Enterprise
Applications
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
© Hortonworks Inc. 2013
Key Capability in Hadoop: Late binding
Page 12
DATA
SERVICES
OPERATIONAL
SERVICES
HORTONWORKS
DATA PLATFORM
HADOOP CORE
WEB LOGS,
CLICK STREAMS
MACHINE
GENERATED
OLTP
Data Mart /
EDW
Client Apps
Dynamically Apply
Transformations
Hortonworks HDP
With traditional ETL, structure must be agreed upon far in advance and is difficult to change.
With Hadoop, capture all data, structure data as business need evolve.
WEB LOGS,
CLICK STREAMS
MACHINE
GENERATED
OLTP
ETL Server Data Mart /
EDW
Client Apps
Store Transformed
Data
© Hortonworks Inc. 2013
Big Data Exploration & Visualization
Page 13
DATASYSTEMSDATASOURCES
Refine Explore Enrich
APPLICATIONS
1 Capture
Capture all data
Process
Parse, cleanse, apply
structure & transform
Exchange
Explore and visualize
with analytics tools
supporting Hadoop
2
3
Collect data and
perform iterative
investigation for value
3
2
TRADITIONAL REPOS
RDBMS EDW MPP
1
Business
Analytics
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
Custom
Applications
Enterprise
Applications
© Hortonworks Inc. 2013
Visualization Tooling
• Robust visualization and business tooling
• Ensures scalability when working with large datasets
Page 14
Native Excel support
Web browser support
Mobile support
© Hortonworks Inc. 2013
Application Enrichment
Page 15
DATASYSTEMSDATASOURCES
Refine Explore Enrich
APPLICATIONS
1 Capture
Capture all data
Process
Parse, cleanse, apply
structure & transform
Exchange
Incorporate data directly
into applications
2
3
Collect data, analyze
and present salient
results for online apps
3
1
2
TRADITIONAL REPOS
RDBMS EDW MPP
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
Custom
Applications
Enterprise
Applications
NOSQL
© Hortonworks Inc. 2013
Web giants proved the ROI in data products
applying data science to large amounts of data
Page 16
Amazon: 35% of
product sales come
from product
recommendations
Netflix: 75% of streaming
video results from
recommendations
Prediction of click
through rates
© Hortonworks Inc. 2013
Interoperating With Your Tools
Page 17
APPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
DEV & DATA
TOOLS
OPERATIONAL
TOOLS
Viewpoint
Microsoft Applications
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
© Hortonworks Inc. 2013
Deep Drive on Hadoop
Components
Page 18
© Hortonworks Inc. 2013
Enhancing the Core of Apache Hadoop
Page 19
HADOOP CORE
PLATFORM SERVICES Enterprise Readiness
HDFS YARN (in 2.0)
MAP REDUCE
Deliver high-scale
storage & processing
with enterprise-ready
platform services
Unique Focus Areas:
• Bigger, faster, more flexible
Continued focus on speed & scale and
enabling near-real-time apps
• Tested & certified at scale
Run ~1300 system tests on large Yahoo
clusters for every release
• Enterprise-ready services
High availability, disaster recovery,
snapshots, security, …
© Hortonworks Inc. 2013
Page 20
HADOOP CORE
DATA
SERVICES
Distributed
Storage & Processing
PLATFORM SERVICES Enterprise Readiness
Data Services for Full Data Lifecycle
WEBHDFS
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
Provide data services to
store, process & access
data in many ways
Unique Focus Areas:
• Apache HCatalog
Metadata services for consistent table
access to Hadoop data
• Apache Hive
Explore & process Hadoop data via SQL &
ODBC-compliant BI tools
• Apache HBase
NoSQL database for Hadoop
• WebHDFS
Access Hadoop files via scalable REST API
• Talend Open Studio for Big Data
Graphical data integration tools
© Hortonworks Inc. 2013
Operational Services for Ease of Use
Page 23
OPERATIONAL
SERVICES
DATA
SERVICES
Store,
Process and
Access Data
HADOOP CORE
Distributed
Storage & Processing
PLATFORM SERVICES Enterprise Readiness
OOZIE
AMBARI
Include complete
operational services for
productive operations
& management
Unique Focus Area:
• Apache Ambari:
Provision, manage & monitor a cluster;
complete REST APIs to integrate with
existing operational tools; job & task
visualizer to diagnose issues
© Hortonworks Inc. 2013
Getting Started
Page 26
© Hortonworks Inc. 2013
Hortonworks Process for Enterprise Hadoop
Page 27
Upstream Community Projects Downstream Enterprise Product
Hortonworks
Data Platform
Design &
Develop
Distribute
Integrate
& Test
Package
& Certify
Apache
HCatalo
g
Apache
Pig
Apache
HBase
Other
Apache
Projects
Apache
Hive
Apache
Ambari
Apache
Hadoop
Test &
Patch
Design & Develop
Release
No Lock-in: Integrated, tested & certified distribution lowers
risk by ensuring close alignment with Apache projects
Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream
Stable Project
Releases
Fixed Issues
© Hortonworks Inc. 2013
OS Cloud VM Appliance
Page 28
PLATFORM SERVICES
HADOOP CORE
DATA
SERVICES
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
Store,
Process and
Access Data
Enterprise Readiness
Only Hortonworks
allows you to deploy
seamlessly across any
deployment option
• Linux & Windows
• Azure, Rackspace & other clouds
• Virtual platforms
• Big data appliances
HORTONWORKS
DATA PLATFORM (HDP)
Distributed
Storage & Processing
Deployable Across a Range of Options
© Hortonworks Inc. 2013
Refine-Explore-Enrich Demo
Page 29
Hands on tutorials
integrated into
Sandbox
HDP environment for
evaluation
The Sandbox lets you experience Apache
Hadoop from the convenience of your own
laptop – no data center, no cloud and no
internet connection needed!
The Hortonworks Sandbox is:
• A free download:
http://hortonworks.com/products/hortonw
orks-sandbox/
• A complete, self contained virtual
machine with Apache Hadoop pre-
configured
• A personal, portable and standalone
Hadoop environment
• A set of hands-on, step-by-step tutorials
that allow you to learn and explore
Hadoop
© Hortonworks Inc. 2013
Hortonworks & Microsoft
Page 30
HDInsight
• Big Data Insight for Millions, Massive
expansion of Hadoop
• Simplifies Hadoop, Enterprise Ready
• Hortonworks Data Platform used for
Hadoop on Windows Server and Azure
• An engineered, open source solution
– Hadoop engineered for Windows
– Hadoop powered Microsoft business tools
– Ops integration with MS System Center
– Bidirectional connectors for SQL Server
– Support for Hyper-V, deploy Hadoop on VMs
– Opens the .NET developer community to Hadoop
– Javascript for Hadoop
– Deploy on Azure in 10 minutes
• Excel
• PowerPivot (BI)
• PowerView
(visualization)
• SharePoint
+
© Hortonworks Inc. 2013
Useful Links
• Hortonworks Sandbox:
– http://hortonworks.com/products/hortonworks-sandbox
• HDInsight Service:
– ?
–User/PWD
• Sample Data:
Page 31
© Hortonworks Inc. 2013
Hadoop 1 hour Workshop
Page 32
© Hortonworks Inc. 2013
Useful Links
• Hortonworks Sandbox:
– http://hortonworks.com/products/hortonworks-sandbox
• HDInsight Service:
– ?
–User/PWD
• Sample Data:
Page 33
© Hortonworks Inc. 2013
Working with HDFS
© Hortonworks Inc. 2013
What is HDFS?
• Stands for Hadoop Distributed File System
• Primary storage system for Hadoop
• Fast and reliable
• Deployed only on Linux (as of May 2012)
–Active work around Hadoop on Windows
© Hortonworks Inc. 2013
HDFS Characteristics (Cont.)
• Write once and read many times
• Files only append
• Data stored in blocks
–Distributed over many nodes
–Block sizes often range from 128MB to 1GB
© Hortonworks Inc. 2013
HDFS Architecture
© Hortonworks Inc. 2013
HDFS Architecture
NameNode
NameSpace
Block Map
Block Management
DataNode
BL1 BL6
BL2 BL7
NameSpace MetaData
Image (Checkpoint)
And Edit Journal Log
Checkpoints Image and
Edit Journal Log (backup)
Secondary
NameNode
DataNode
BL1 BL3
BL6 BL2
DataNode
BL1 BL7
BL8 BL9
© Hortonworks Inc. 2013
Data Organization
• Metadata
–organized into files and directories
–linux-like permissions prevent accidental deletions
• Files
–divided into uniform sized blocks
–default 64 MB
–distributed across clusters
• Rack-aware
• Keeps sizing checksums
–for corruption detection
© Hortonworks Inc. 2013
HDFS Cluster
•HDFS runs in Hadoop distributed mode
–on a cluster
•3 main components:
–NameNode
–Manages DataNodes
–Keeps metadata for all nodes & blocks
–DataNodes
–Hold data blocks
–Live on racks
–Client
–Talks directly to NameNode then DataNodes
© Hortonworks Inc. 2013
NameNode
• Server running the namenode daemon
–Responsible for coordinating datanodes
• The Master of the DataNodes
• Problems
–Lots of overhead to being the Master
–Should be special server for performance
–Single point of failure
© Hortonworks Inc. 2013
DataNode
• A common node running a datanode daemon
• Slave
•Manages block reads/writes for HDFS
•Manages block replication
• Pings NameNode and gets instructions back
•If heartbeat fails
–NameNode removes from cluster
–Replicated blocks take over
© Hortonworks Inc. 2013
HDFS Heartbeats
HDFS
heartbeats
Data
Node
daemon
Data
Node
daemon
Data
Node
daemon
Data
Node
daemon
“Im datanode X, and
I’m OK; I do have some
new information for you:
the new blocks are …”
NameNode
fsimage
editlog
© Hortonworks Inc. 2013
HDFS Commands
© Hortonworks Inc. 2012
Here are a few (of the almost 30) HDFS commands:
-cat: just like Unix cat – display file content (uncompressed)
-text: just like cat – but works on compressed files
-chgrp,-chmod,-chown: just like the Unix command, changes
permissions
-put,-get,-copyFromLocal,-copyToLocal: copies files
from the local file system to the HDFS and vice-versa. Two versions.
-ls, -lsr: just like Unix ls, list files/directories
-mv,-moveFromLocal,-moveToLocal: moves files
-stat: statistical info for any given file (block size, number of blocks,
file type, etc.)
Basic HDFS File System Commands
$ hadoop fs –command <args>
© Hortonworks Inc. 2012 Page 47
Commands Example
$ hadoop fs –ls /user/brian/
$ hadoop fs -lsr
$ hadoop fs –mkdir notes
$ hadoop fs –put ~/training/commands.txt notes
$ hadoop fs –chmod 777 notes/commands.txt
$ hadoop fs –cat notes/commands.txt | more
$ hadoop fs –rm notes/*.txt
© Hortonworks Inc. 2012 Page 48
Uploading Files into HDFS
$ hadoop fs –put filenameSrc filenameDest
$ hadoop fs –put filename dirName/fileName
$ hadoop fs –put foo bar
$ hadoop fs –put foo dirName/fileName
$ hadoop fs –lsr dirName
© Hortonworks Inc. 2012 Page 49
Retrieving Files
Note: Another name for the –get command is -copyToLocal
$ hadoop fs –cat foo
$ hadoop fs –get foo LocalFoo
$ hadoop fs –rmr directory|file
© Hortonworks Inc. 2013
How MapReduce Works
© Hortonworks Inc. 2012 Page 51
Input
Format
Map Sort Reduce
Output
Format
Node Node
Partitioner
MapReduce
Basic MapReduce Architecture
Distributed File System (HDFS)
© Hortonworks Inc. 2012 Page 52
Simple MapReduce
Map Reduce
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Input
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Key|Value
Intermediate
Key|Value,
Key|Value,
Key|Value,
Key|Value
Result
= some kind of collection
Key|Value,
Key|Value,
Key|Value
Key|Value,
Key|Value
© Hortonworks Inc. 2012 Page 53
InputFormat
• Determines how the data is split up
• Creates InputSplit[] arrays
– Each is individual map
– Associated with a list of destination nodes
• RecordReader
– Makes key,value pairs
– Converts data types
© Hortonworks Inc. 2013 Page ‹#›
© Hortonworks Inc. 2012 Page 55
Partitioner
• Distributes key,value pairs
• Decides the target Reducer
– Uses the key to determine
– Uses Hash function by default
– Can be custom
Reduce Phase
getPartition(K2 key, V2 value, int numPartitions)
© Hortonworks Inc. 2012 Page 56
The Shuffle
Map
Map
Map
Reduce
Reduce
HTTP
Reduce Phase
© Hortonworks Inc. 2012 Page 57
Sort
• Guarantees sorted inputs to Reducers
• Final step of Shuffle
• Helps to merge Reducer inputs
Reduce Phase
© Hortonworks Inc. 2012 Page 58
Reduce
• Receives output from many Mappers
• Consolidates values for common
intermediate keys
• Groups values by key
reduce(K2 key, Iterator<V2> values,
OutputCollector<K3,V3> output,
Reporter reporter)
© Hortonworks Inc. 2012 Page 59
OutputFormat
• Validator
– For output specs
• Sets up a RecordWriter
– Which writes out to HDFS
– Organizes output into part-0000x files
© Hortonworks Inc. 2012 Page 60
A Basic MapReduce Job
map() implemented
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
© Hortonworks Inc. 2012 Page 61
A Basic MapReduce Job
reduce() implemented
private final IntWritable totalCount = new IntWritable();
public void reduce(Text key,
Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
totalCount.set(sum);
output.collect(key, totalCount);
}
© Hortonworks Inc. 2012 Page 62
Another Use Case – Inverted Index
Map Reduce
www.yahoo.com | news sports finance email celebrity
www.amazon.com | shoes books jeans
www.google.com | news finance email search
www.microsoft.com | operating-system productivity search
books | www.amazon.com www.target.com
celebrity | www.yahoo.com
email | www.google.com www.yahoo.com www.facebook.com
finance | www.yahoo.com www.google.com
© Hortonworks Inc. 2013
Pig
© Hortonworks Inc. 2012 Page 64
What is Pig?
• Pig is an extension of Hadoop that simplifies the
ability to query large HDFS datasets
• Pig is made up of two main components:
– A SQL-like data processing language called Pig Latin
– A compiler that compiles and runs Pig Latin scripts
• Pig was created at Yahoo! to make it easier to analyze
the data in your HDFS without the complexities of
writing a traditional MapReduce program
• With Pig, you can develop MapReduce jobs with a
few lines of Pig Latin
© Hortonworks Inc. 2012 Page 65
Pig In The EcoSystem
• Pig runs on Hadoop utilizing both HDFS and
MapReduce
• By default, Pig reads and writes files from HDFS
• Pig stores intermediate data among MapReduce jobs
HDFS
MapReduce
Pig
HCatalog
HBase
© Hortonworks Inc. 2012 Page 66
Running Pig
A Pig Latin script executes in three modes:
1. MapReduce: the code executes as a MapReduce
application on a Hadoop cluster (the default mode)
2. Local: the code executes locally in a single JVM using a
local text file (for development purposes)
1. Interactive: Pig commands are entered manually at a
command prompt known as the Grunt shell
$ pig myscript.pig
$ pig -x local myscript.pig
$ pig
grunt>
© Hortonworks Inc. 2012 Page 67
Understanding Pig Execution
• Pig Latin is a data flow language
• During execution each statement is processed by the
Pig interpreter
• If a statement is valid, it gets added to a logical plan
built by the interpreter
• The steps in the logical plan do not actually execute
until a DUMP or STORE command
© Hortonworks Inc. 2012 Page 68
A Pig Example
• The first three commands are built into a logical plan
• The STORE command triggers the logical plan to be
built into a physical plan
• The physical plan will be executed as one or more
MapReduce jobs
logevents = LOAD ‘input/my.log’ AS (date, level, code,
message);
severe = FILTER logevents BY (level == ‘severe’
AND code >= 500);
grouped = GROUP severe BY code;
STORE grouped INTO ‘output/severeevents’;
© Hortonworks Inc. 2012 Page 69
An Interactive Pig Session
• Command line history and editing
• Tab will complete commands (but not
filenames)
• To exit enter quit
$ pig
grunt> A = LOAD ‘myfile’;
grunt> DUMP A;
(output appears here)
© Hortonworks Inc. 2012 Page 70
Pig Command Options
• To see a full listing enter:
pig –h or help
• Execute
-e or –execute
-f scriptname or -filename scriptname
• Specify a parameter setting
-p or –parameter
example: -p key1=value1 –p key2=value2
• List the properties that Pig will use if they are set
by the user
-h properties
• Display the version
-version
© Hortonworks Inc. 2012 Page 71
Grunt – HDFS Commands
• Grunt acts as a shell to access HDFS
• Commands include:
fs -ls
fs -cat filename
fs -copyFromLocal localfile hdfsfile
fs -copyToLocal hdfsfile localfile
fs -rm filename
fs -mkdir dirname
fs -mv fromLocation/filename toLocation/filename
© Hortonworks Inc. 2012 Page 72
Pig’s Data Model
• 6 Scalar Types
– int, long, float, double, chararray, bytearray
• 3 Complex types
– Tuple: ordered set of values
• (F, 66, 41000, 95103)
– Bag: unordered collection of tuples
• { (F, 66, 41000, 95103), (M, 40, 14000, 95102) }
– Map: collection of key value pairs
• [name#Bob, age#34]
© Hortonworks Inc. 2012 Page 73
Relations
• Pig Latin statements work with relations
• A bag of tuples
• Similar to a table in a relational database, where the
tuples in the bag correspond to rows in a table
• Unlike rows in a table
– the tuples in a Pig relation do not have to contain the same
number of fields
– nor do the fields have to be the same data type
• Relation schemas are optional
© Hortonworks Inc. 2012 Page 74
Defining Relations
• The LOAD command loads data from a file into a
relation. The syntax looks like:
where ‘data’ is either a filename or directory
• Use the AS option to define a schema for the
relation:
• TIP: Use the DESCRIBE command to view the
schema of a relation
alias = LOAD ‘data’;
alias = LOAD ‘data’ AS (name1:type,name2:type,...);
DESCRIBE alias;
© Hortonworks Inc. 2012 Page 75
A Relation with a Schema
• Suppose we have the following data in HDFS:
Tom,21,94085,62000
John,45,95014,25000
Joe,21,94085,50000
Larry,45,95014,36000
Hans,21,94085,80000
• The data above represents the name, age, ZIP code
and salary of employees
© Hortonworks Inc. 2012 Page 76
A Relation with a Schema
• The following LOAD command defines a relation
named employees with a schema:
• The DESCRIBE command for employees outputs the
following:
employees = LOAD 'pig/input/File1'
USING PigStorage(',')
AS (name:chararray, age:int,
zip:int, salary:double);
describe employees;
employees: {name: chararray,age: int,zip:
int,salary: double}
© Hortonworks Inc. 2012 Page 77
Default Schema Datatype
• If not specified, the data type of a field in a relation
defaults to bytearray
• What will the data type be for each field in the
following relation?
employees = LOAD 'pig/input/File1'
USING PigStorage(',')
AS (name:chararray,
age,
zip:int,
salary);
© Hortonworks Inc. 2012 Page 78
Using a Schema
• Defining a schema allows you to refer to the values
of a relation by the name of the field in the schema
• Because we defined a schema for the employees
relation, the FILTER command can refer to the
second field in the relation by the name “age”
employees = LOAD 'pig/input/File1'
USING PigStorage(',')
AS (name:chararray,age:int,
zip:int,salary:double);
newhires = FILTER employees BY age <= 21;
© Hortonworks Inc. 2012 Page 79
Relations without a Schema
• If a relation does not define a schema, then Pig will
simply load the data anyway (because “pigs eat
anything”)
• The output of the above DESCRIBE command is:
employees = LOAD 'pig/input/File1'
USING PigStorage(',');
DESCRIBE employees;
Schema for employees unknown.
© Hortonworks Inc. 2012 Page 80
Relations without a Schema
• Without a schema, a field is referenced by its
position within the relation
• $0 is the first field, $1 is the second field, and so on
• The output of the above commands is:
employees = LOAD 'pig/input/File1'
USING PigStorage(',');
newhires = FILTER employess BY $1 <= 21;
DUMP newhires;
(Tom,21,94085,5000.0)
(Joe,21,94085,50000.0)
(Hans,21,94085,80000.0)
© Hortonworks Inc. 2012 Page 81
• Data elements may be null
– Null means that the data element value is “undefined”
• In a LOAD command, null is automatically inserted
for missing or invalid fields
• Example:
LOAD ‘a.txt’ AS (a1:int, a2:int) using
PigStorage(‘,’);
Is loaded as:This data:
Nulls in Pig
1,2,3
5
(1,2)
(5, null)
6,bye (6, null)
© Hortonworks Inc. 2012 Page 82
The GROUP Operator
• The GROUP operator groups together tuples based
on a specified key
• The usage for GROUP is:
x = GROUP alias BY expression
– alias = the name of the existing relation that you want to
group together
– expression = a tuple expression that is the key you want to
group by
• The result of the GROUP command is a new relation
© Hortonworks Inc. 2012 Page 83
A GROUP Example
• The output of DESCRIBE is:
employees = LOAD 'pig/input/File1'
USING PigStorage(',')
AS (name:chararray,age:int,
zip:int,salary:double);
a = GROUP employees BY salary;
DESCRIBE a;
a: {group: double, employees:
{(name: chararray,
age:int,
zip:int,
salary: double)
}
}
© Hortonworks Inc. 2012 Page 84
More GROUP Examples
daily = LOAD ‘NYSE_daily’ AS (exchange, stock,
date, dividends);
grpd = GROUP daily BY (exchange, stock);
DESCRIBE grpd;
grpd: {group: (exchange:bytearray,
stock:bytearray), daily: {{exchange:bytearray,
stock:bytearray, date:bytearray, dividends:
bytearray)}}
daily = LOAD ‘NYSE_daily’ AS (exchange, stock);
grpd = GROUP daily BY stock;
cnt = FOREACH grpd GENERATE group, COUNT (daily);
© Hortonworks Inc. 2012 Page 85
The JOIN Operator
• The JOIN operator performs an inner join on two or
more relations based on common field values
• The syntax for JOIN is:
x = JOIN alias BY expression, alias BY expression,…
– alias = an existing relation
– expression = a field of the relation
• The result of JOIN is a flat set of tuples
© Hortonworks Inc. 2012 Page 86
A JOIN Example
• Suppose we add another set of data that contains
the employee’s name and a phone number:
Tom,4085551211
Tom,6505550123
John,4085554332
Joe,4085559898
Joe,4085557777
© Hortonworks Inc. 2012 Page 87
A JOIN Example
• The output of the DESCRIBE above is:
e1 = LOAD 'pig/input/File1' USING PigStorage(',')
AS (name:chararray,age:int,
zip:int,salary:double);
e2 = LOAD 'pig/input/File2' USING PigStorage(',')
AS (name:chararray,phone:chararray);
e3 = JOIN e1 BY name, e2 BY name;
DESCRIBE e3;
e3: {e1::name:chararray, e1::age:int,
e1::zip:int,e1::salary:double,
e2::name:chararray,e2::phone:chararray}
© Hortonworks Inc. 2012 Page 88
A JOIN Example
• The JOIN output looks like:
grunt> DUMP e3;
(Joe,21,94085,50000.0,Joe,4085559898)
(Joe,21,94085,50000.0,Joe,4085557777)
(Tom,21,94085,5000.0,Tom,4085551211)
(Tom,21,94085,5000.0,Tom,6505550123)
(John,45,95014,25000.0,John,4085554332)
© Hortonworks Inc. 2012 Page 89
The FOREACH Operator
• The FOREACH operator transforms data into a new
relation based on the columns of the data
• The syntax looks like:
x = FOREACH alias GENERATE expression
– alias = an existing relation
– expression = an expression that determines the output
© Hortonworks Inc. 2012 Page 90
A FOREACH Example
• The output of this example is a bag:
e1 = LOAD 'pig/input/File1' USING PigStorage(',')
AS (name:chararray,age:int,
zip:int,salary:double);
f = FOREACH e1 GENERATE age,salary;
DESCRIBE f;
DUMP f;
f: {age:int, salary:double}
(21,5000.0)
(45,25000.0)
(21,50000.0)
(45,36000.0)
(21,80000.0)
© Hortonworks Inc. 2012 Page 91
Using FOREACH on Groups
e1 = LOAD 'pig/input/File1' USING PigStorage(',')
AS (name:chararray,age:int,
zip:int,salary:double);
g = GROUP e1 BY age;
DESCRIBE g;
g: {group: int,e1: {(name:chararray,
age:int, zip:int, salary:double)}}
f = FOREACH g GENERATE group, SUM(e1.salary);
DESCRIBE f;
f: {group: int,double}
DUMP f;
(21,135000.0)
(45,61000.0)
© Hortonworks Inc. 2012 Page 92
Pig Latin Structured Processing Flow
• Pig Latin script describes a directed acyclic graph (DAG)
– The edges are data flows and the nodes are operators that
process the data
LOAD
FOREACH
JOIN
LOAD
FOREACH
1
2 3
4 5 6
7
© Hortonworks Inc. 2013
Thank You!
Questions & Answers
Page 96

Contenu connexe

Tendances

Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageHortonworks
 
The Next Generation of Big Data Analytics
The Next Generation of Big Data AnalyticsThe Next Generation of Big Data Analytics
The Next Generation of Big Data AnalyticsHortonworks
 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoptionHortonworks
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarHortonworks
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksHortonworks
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Hortonworks
 
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptxHortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptxHortonworks
 
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Hortonworks
 
Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group Hortonworks
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Hortonworks
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
 
Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Hortonworks
 

Tendances (20)

Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble Storage
 
The Next Generation of Big Data Analytics
The Next Generation of Big Data AnalyticsThe Next Generation of Big Data Analytics
The Next Generation of Big Data Analytics
 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinar
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25
 
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptxHortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
 
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
 
Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data Processing
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinar
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?
 

En vedette

Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Hortonworks
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisHortonworks
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your BudgetHortonworks
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Hortonworks
 
Adoption de Hadoop : des Possibilités Illimitées - Hortonworks and Talend
Adoption de Hadoop : des Possibilités Illimitées - Hortonworks and TalendAdoption de Hadoop : des Possibilités Illimitées - Hortonworks and Talend
Adoption de Hadoop : des Possibilités Illimitées - Hortonworks and TalendHortonworks
 

En vedette (6)

Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011
 
Pig on Storm
Pig on StormPig on Storm
Pig on Storm
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your Budget
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'
 
Adoption de Hadoop : des Possibilités Illimitées - Hortonworks and Talend
Adoption de Hadoop : des Possibilités Illimitées - Hortonworks and TalendAdoption de Hadoop : des Possibilités Illimitées - Hortonworks and Talend
Adoption de Hadoop : des Possibilités Illimitées - Hortonworks and Talend
 

Similaire à Yahoo! Hack Europe

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldCA Technologies
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupMats Johansson
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Pactera_US
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Hortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts PresentationHortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts PresentationHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...Hortonworks
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 

Similaire à Yahoo! Hack Europe (20)

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User Group
 
Meetup oslo hortonworks HDP
Meetup oslo hortonworks HDPMeetup oslo hortonworks HDP
Meetup oslo hortonworks HDP
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Hortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts PresentationHortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts Presentation
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 

Plus de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Plus de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Dernier

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Dernier (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Yahoo! Hack Europe

  • 1. © Hortonworks Inc. 2013 Chris Harris Twitter : cj_harris5 E-mail : charris@hortonworks.com Page 1 Introduction to Big Data with Hadoop
  • 2. © Hortonworks Inc. 2013 What is Big Data? Page 2
  • 3. © Hortonworks Inc. 2013 Web giants proved the ROI in data products applying data science to large amounts of data Page 3 Amazon: 35% of product sales come from product recommendations Netflix: 75% of streaming video results from recommendations Prediction of click through rates
  • 4. © Hortonworks Inc. 2013 Data science is a natural next step after business intelligence Page 4 Value Refine Extract Enrich Data Science Dashboards Reports Score-cards Affinity Analysis Outlier Detection Clustering Recommendation Regression Classification Business Intelligence: measure & count; simple analytics Data Science: discovery & prediction; complex analytics; “data product” Discovery Prediction
  • 5. © Hortonworks Inc. 2013 Key use-cases in Finance/Insurance • Customer risk profiling: –How likely is this customer to pay back his mortgage? –How likely is this customer to get sick? • Fraud detection: –Detect illegal credit card activity and alert bank/consumer –Detect illegal insurance claims • Internal fraud detection (compliance): –Is this employee accessing financial information they are not allowed to access? Page 5
  • 6. © Hortonworks Inc. 2013 Key use-cases in Telco/Mobile • Customer life-time-value prediction –What is the LTV for customer X? • Marketing –Which new mobile phone should we offer to customer X so that they remain with us? –Location based advertising • Failure prediction –When will equipment X in cell tower Y fail? • Cell Tower Management –Predict load and bandwidth on cell towers to optimize network Page 6
  • 7. © Hortonworks Inc. 2013 Key use-cases in Healthcare • Clinical Decision Support: –What is the ideal treatment for this patient? • Cost management: –What is the expected overall cost of treatment for this patient over the life of the disease • Diagnostics: –Given these test results, what is the likelihood of cancer? • Epidemic management –Predict size and location of epidemic spread Page 7
  • 8. © Hortonworks Inc. 2013 What is Hadoop? Page 8
  • 9. © Hortonworks Inc. 2013 A Brief History of Apache Hadoop Page 9 2013 Focus on INNOVATION 2005: Yahoo! creates team under E14 to work on Hadoop Focus on OPERATIONS 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Yahoo! begins to Operate at scale Enterprise Hadoop Apache Project Established Hortonworks Data Platform 2004 2008 2010 20122006 STABILITY 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 key Hadoop engineers from Yahoo
  • 10. © Hortonworks Inc. 2013 Leadership that Starts at the Core Page 10 • Driving next generation Hadoop – YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery • 420k+ lines authored since 2006 – More than twice nearest contributor • Deeply integrating w/ecosystem – Enabling new deployment platforms – (ex. Windows & Azure, Linux & VMware HA) – Creating deeply engineered solutions – (ex. Teradata big data appliance) • All Apache, NO holdbacks – 100% of code contributed to Apache
  • 11. © Hortonworks Inc. 2013 Operational Data Refinery Page 11 DATASYSTEMSDATASOURCES 1 3 1 Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Push to existing data warehouse for use with existing analytic tools 2 3 Refine Explore Enric h 2 APPLICATIONS Collect data and apply a known algorithm to it in trusted operational process TRADITIONAL REPOS RDBMS EDW MPP Business Analytics Custom Applications Enterprise Applications Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media)
  • 12. © Hortonworks Inc. 2013 Key Capability in Hadoop: Late binding Page 12 DATA SERVICES OPERATIONAL SERVICES HORTONWORKS DATA PLATFORM HADOOP CORE WEB LOGS, CLICK STREAMS MACHINE GENERATED OLTP Data Mart / EDW Client Apps Dynamically Apply Transformations Hortonworks HDP With traditional ETL, structure must be agreed upon far in advance and is difficult to change. With Hadoop, capture all data, structure data as business need evolve. WEB LOGS, CLICK STREAMS MACHINE GENERATED OLTP ETL Server Data Mart / EDW Client Apps Store Transformed Data
  • 13. © Hortonworks Inc. 2013 Big Data Exploration & Visualization Page 13 DATASYSTEMSDATASOURCES Refine Explore Enrich APPLICATIONS 1 Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Explore and visualize with analytics tools supporting Hadoop 2 3 Collect data and perform iterative investigation for value 3 2 TRADITIONAL REPOS RDBMS EDW MPP 1 Business Analytics Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) Custom Applications Enterprise Applications
  • 14. © Hortonworks Inc. 2013 Visualization Tooling • Robust visualization and business tooling • Ensures scalability when working with large datasets Page 14 Native Excel support Web browser support Mobile support
  • 15. © Hortonworks Inc. 2013 Application Enrichment Page 15 DATASYSTEMSDATASOURCES Refine Explore Enrich APPLICATIONS 1 Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Incorporate data directly into applications 2 3 Collect data, analyze and present salient results for online apps 3 1 2 TRADITIONAL REPOS RDBMS EDW MPP Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) Custom Applications Enterprise Applications NOSQL
  • 16. © Hortonworks Inc. 2013 Web giants proved the ROI in data products applying data science to large amounts of data Page 16 Amazon: 35% of product sales come from product recommendations Netflix: 75% of streaming video results from recommendations Prediction of click through rates
  • 17. © Hortonworks Inc. 2013 Interoperating With Your Tools Page 17 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS DEV & DATA TOOLS OPERATIONAL TOOLS Viewpoint Microsoft Applications DATASOURCES MOBILE DATA OLTP, POS SYSTEMS Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media)
  • 18. © Hortonworks Inc. 2013 Deep Drive on Hadoop Components Page 18
  • 19. © Hortonworks Inc. 2013 Enhancing the Core of Apache Hadoop Page 19 HADOOP CORE PLATFORM SERVICES Enterprise Readiness HDFS YARN (in 2.0) MAP REDUCE Deliver high-scale storage & processing with enterprise-ready platform services Unique Focus Areas: • Bigger, faster, more flexible Continued focus on speed & scale and enabling near-real-time apps • Tested & certified at scale Run ~1300 system tests on large Yahoo clusters for every release • Enterprise-ready services High availability, disaster recovery, snapshots, security, …
  • 20. © Hortonworks Inc. 2013 Page 20 HADOOP CORE DATA SERVICES Distributed Storage & Processing PLATFORM SERVICES Enterprise Readiness Data Services for Full Data Lifecycle WEBHDFS HCATALOG HIVEPIG HBASE SQOOP FLUME Provide data services to store, process & access data in many ways Unique Focus Areas: • Apache HCatalog Metadata services for consistent table access to Hadoop data • Apache Hive Explore & process Hadoop data via SQL & ODBC-compliant BI tools • Apache HBase NoSQL database for Hadoop • WebHDFS Access Hadoop files via scalable REST API • Talend Open Studio for Big Data Graphical data integration tools
  • 21. © Hortonworks Inc. 2013 Operational Services for Ease of Use Page 23 OPERATIONAL SERVICES DATA SERVICES Store, Process and Access Data HADOOP CORE Distributed Storage & Processing PLATFORM SERVICES Enterprise Readiness OOZIE AMBARI Include complete operational services for productive operations & management Unique Focus Area: • Apache Ambari: Provision, manage & monitor a cluster; complete REST APIs to integrate with existing operational tools; job & task visualizer to diagnose issues
  • 22. © Hortonworks Inc. 2013 Getting Started Page 26
  • 23. © Hortonworks Inc. 2013 Hortonworks Process for Enterprise Hadoop Page 27 Upstream Community Projects Downstream Enterprise Product Hortonworks Data Platform Design & Develop Distribute Integrate & Test Package & Certify Apache HCatalo g Apache Pig Apache HBase Other Apache Projects Apache Hive Apache Ambari Apache Hadoop Test & Patch Design & Develop Release No Lock-in: Integrated, tested & certified distribution lowers risk by ensuring close alignment with Apache projects Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream Stable Project Releases Fixed Issues
  • 24. © Hortonworks Inc. 2013 OS Cloud VM Appliance Page 28 PLATFORM SERVICES HADOOP CORE DATA SERVICES OPERATIONAL SERVICES Manage & Operate at Scale Store, Process and Access Data Enterprise Readiness Only Hortonworks allows you to deploy seamlessly across any deployment option • Linux & Windows • Azure, Rackspace & other clouds • Virtual platforms • Big data appliances HORTONWORKS DATA PLATFORM (HDP) Distributed Storage & Processing Deployable Across a Range of Options
  • 25. © Hortonworks Inc. 2013 Refine-Explore-Enrich Demo Page 29 Hands on tutorials integrated into Sandbox HDP environment for evaluation The Sandbox lets you experience Apache Hadoop from the convenience of your own laptop – no data center, no cloud and no internet connection needed! The Hortonworks Sandbox is: • A free download: http://hortonworks.com/products/hortonw orks-sandbox/ • A complete, self contained virtual machine with Apache Hadoop pre- configured • A personal, portable and standalone Hadoop environment • A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop
  • 26. © Hortonworks Inc. 2013 Hortonworks & Microsoft Page 30 HDInsight • Big Data Insight for Millions, Massive expansion of Hadoop • Simplifies Hadoop, Enterprise Ready • Hortonworks Data Platform used for Hadoop on Windows Server and Azure • An engineered, open source solution – Hadoop engineered for Windows – Hadoop powered Microsoft business tools – Ops integration with MS System Center – Bidirectional connectors for SQL Server – Support for Hyper-V, deploy Hadoop on VMs – Opens the .NET developer community to Hadoop – Javascript for Hadoop – Deploy on Azure in 10 minutes • Excel • PowerPivot (BI) • PowerView (visualization) • SharePoint +
  • 27. © Hortonworks Inc. 2013 Useful Links • Hortonworks Sandbox: – http://hortonworks.com/products/hortonworks-sandbox • HDInsight Service: – ? –User/PWD • Sample Data: Page 31
  • 28. © Hortonworks Inc. 2013 Hadoop 1 hour Workshop Page 32
  • 29. © Hortonworks Inc. 2013 Useful Links • Hortonworks Sandbox: – http://hortonworks.com/products/hortonworks-sandbox • HDInsight Service: – ? –User/PWD • Sample Data: Page 33
  • 30. © Hortonworks Inc. 2013 Working with HDFS
  • 31. © Hortonworks Inc. 2013 What is HDFS? • Stands for Hadoop Distributed File System • Primary storage system for Hadoop • Fast and reliable • Deployed only on Linux (as of May 2012) –Active work around Hadoop on Windows
  • 32. © Hortonworks Inc. 2013 HDFS Characteristics (Cont.) • Write once and read many times • Files only append • Data stored in blocks –Distributed over many nodes –Block sizes often range from 128MB to 1GB
  • 33. © Hortonworks Inc. 2013 HDFS Architecture
  • 34. © Hortonworks Inc. 2013 HDFS Architecture NameNode NameSpace Block Map Block Management DataNode BL1 BL6 BL2 BL7 NameSpace MetaData Image (Checkpoint) And Edit Journal Log Checkpoints Image and Edit Journal Log (backup) Secondary NameNode DataNode BL1 BL3 BL6 BL2 DataNode BL1 BL7 BL8 BL9
  • 35. © Hortonworks Inc. 2013 Data Organization • Metadata –organized into files and directories –linux-like permissions prevent accidental deletions • Files –divided into uniform sized blocks –default 64 MB –distributed across clusters • Rack-aware • Keeps sizing checksums –for corruption detection
  • 36. © Hortonworks Inc. 2013 HDFS Cluster •HDFS runs in Hadoop distributed mode –on a cluster •3 main components: –NameNode –Manages DataNodes –Keeps metadata for all nodes & blocks –DataNodes –Hold data blocks –Live on racks –Client –Talks directly to NameNode then DataNodes
  • 37. © Hortonworks Inc. 2013 NameNode • Server running the namenode daemon –Responsible for coordinating datanodes • The Master of the DataNodes • Problems –Lots of overhead to being the Master –Should be special server for performance –Single point of failure
  • 38. © Hortonworks Inc. 2013 DataNode • A common node running a datanode daemon • Slave •Manages block reads/writes for HDFS •Manages block replication • Pings NameNode and gets instructions back •If heartbeat fails –NameNode removes from cluster –Replicated blocks take over
  • 39. © Hortonworks Inc. 2013 HDFS Heartbeats HDFS heartbeats Data Node daemon Data Node daemon Data Node daemon Data Node daemon “Im datanode X, and I’m OK; I do have some new information for you: the new blocks are …” NameNode fsimage editlog
  • 40. © Hortonworks Inc. 2013 HDFS Commands
  • 41. © Hortonworks Inc. 2012 Here are a few (of the almost 30) HDFS commands: -cat: just like Unix cat – display file content (uncompressed) -text: just like cat – but works on compressed files -chgrp,-chmod,-chown: just like the Unix command, changes permissions -put,-get,-copyFromLocal,-copyToLocal: copies files from the local file system to the HDFS and vice-versa. Two versions. -ls, -lsr: just like Unix ls, list files/directories -mv,-moveFromLocal,-moveToLocal: moves files -stat: statistical info for any given file (block size, number of blocks, file type, etc.) Basic HDFS File System Commands $ hadoop fs –command <args>
  • 42. © Hortonworks Inc. 2012 Page 47 Commands Example $ hadoop fs –ls /user/brian/ $ hadoop fs -lsr $ hadoop fs –mkdir notes $ hadoop fs –put ~/training/commands.txt notes $ hadoop fs –chmod 777 notes/commands.txt $ hadoop fs –cat notes/commands.txt | more $ hadoop fs –rm notes/*.txt
  • 43. © Hortonworks Inc. 2012 Page 48 Uploading Files into HDFS $ hadoop fs –put filenameSrc filenameDest $ hadoop fs –put filename dirName/fileName $ hadoop fs –put foo bar $ hadoop fs –put foo dirName/fileName $ hadoop fs –lsr dirName
  • 44. © Hortonworks Inc. 2012 Page 49 Retrieving Files Note: Another name for the –get command is -copyToLocal $ hadoop fs –cat foo $ hadoop fs –get foo LocalFoo $ hadoop fs –rmr directory|file
  • 45. © Hortonworks Inc. 2013 How MapReduce Works
  • 46. © Hortonworks Inc. 2012 Page 51 Input Format Map Sort Reduce Output Format Node Node Partitioner MapReduce Basic MapReduce Architecture Distributed File System (HDFS)
  • 47. © Hortonworks Inc. 2012 Page 52 Simple MapReduce Map Reduce Key|Value Key|Value Key|Value Key|Value Key|Value Key|Value Key|Value Key|Value Key|Value Input Key|Value Key|Value Key|Value Key|Value Key|Value Key|Value Key|Value Key|Value Key|Value Intermediate Key|Value, Key|Value, Key|Value, Key|Value Result = some kind of collection Key|Value, Key|Value, Key|Value Key|Value, Key|Value
  • 48. © Hortonworks Inc. 2012 Page 53 InputFormat • Determines how the data is split up • Creates InputSplit[] arrays – Each is individual map – Associated with a list of destination nodes • RecordReader – Makes key,value pairs – Converts data types
  • 49. © Hortonworks Inc. 2013 Page ‹#›
  • 50. © Hortonworks Inc. 2012 Page 55 Partitioner • Distributes key,value pairs • Decides the target Reducer – Uses the key to determine – Uses Hash function by default – Can be custom Reduce Phase getPartition(K2 key, V2 value, int numPartitions)
  • 51. © Hortonworks Inc. 2012 Page 56 The Shuffle Map Map Map Reduce Reduce HTTP Reduce Phase
  • 52. © Hortonworks Inc. 2012 Page 57 Sort • Guarantees sorted inputs to Reducers • Final step of Shuffle • Helps to merge Reducer inputs Reduce Phase
  • 53. © Hortonworks Inc. 2012 Page 58 Reduce • Receives output from many Mappers • Consolidates values for common intermediate keys • Groups values by key reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter)
  • 54. © Hortonworks Inc. 2012 Page 59 OutputFormat • Validator – For output specs • Sets up a RecordWriter – Which writes out to HDFS – Organizes output into part-0000x files
  • 55. © Hortonworks Inc. 2012 Page 60 A Basic MapReduce Job map() implemented private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }
  • 56. © Hortonworks Inc. 2012 Page 61 A Basic MapReduce Job reduce() implemented private final IntWritable totalCount = new IntWritable(); public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } totalCount.set(sum); output.collect(key, totalCount); }
  • 57. © Hortonworks Inc. 2012 Page 62 Another Use Case – Inverted Index Map Reduce www.yahoo.com | news sports finance email celebrity www.amazon.com | shoes books jeans www.google.com | news finance email search www.microsoft.com | operating-system productivity search books | www.amazon.com www.target.com celebrity | www.yahoo.com email | www.google.com www.yahoo.com www.facebook.com finance | www.yahoo.com www.google.com
  • 59. © Hortonworks Inc. 2012 Page 64 What is Pig? • Pig is an extension of Hadoop that simplifies the ability to query large HDFS datasets • Pig is made up of two main components: – A SQL-like data processing language called Pig Latin – A compiler that compiles and runs Pig Latin scripts • Pig was created at Yahoo! to make it easier to analyze the data in your HDFS without the complexities of writing a traditional MapReduce program • With Pig, you can develop MapReduce jobs with a few lines of Pig Latin
  • 60. © Hortonworks Inc. 2012 Page 65 Pig In The EcoSystem • Pig runs on Hadoop utilizing both HDFS and MapReduce • By default, Pig reads and writes files from HDFS • Pig stores intermediate data among MapReduce jobs HDFS MapReduce Pig HCatalog HBase
  • 61. © Hortonworks Inc. 2012 Page 66 Running Pig A Pig Latin script executes in three modes: 1. MapReduce: the code executes as a MapReduce application on a Hadoop cluster (the default mode) 2. Local: the code executes locally in a single JVM using a local text file (for development purposes) 1. Interactive: Pig commands are entered manually at a command prompt known as the Grunt shell $ pig myscript.pig $ pig -x local myscript.pig $ pig grunt>
  • 62. © Hortonworks Inc. 2012 Page 67 Understanding Pig Execution • Pig Latin is a data flow language • During execution each statement is processed by the Pig interpreter • If a statement is valid, it gets added to a logical plan built by the interpreter • The steps in the logical plan do not actually execute until a DUMP or STORE command
  • 63. © Hortonworks Inc. 2012 Page 68 A Pig Example • The first three commands are built into a logical plan • The STORE command triggers the logical plan to be built into a physical plan • The physical plan will be executed as one or more MapReduce jobs logevents = LOAD ‘input/my.log’ AS (date, level, code, message); severe = FILTER logevents BY (level == ‘severe’ AND code >= 500); grouped = GROUP severe BY code; STORE grouped INTO ‘output/severeevents’;
  • 64. © Hortonworks Inc. 2012 Page 69 An Interactive Pig Session • Command line history and editing • Tab will complete commands (but not filenames) • To exit enter quit $ pig grunt> A = LOAD ‘myfile’; grunt> DUMP A; (output appears here)
  • 65. © Hortonworks Inc. 2012 Page 70 Pig Command Options • To see a full listing enter: pig –h or help • Execute -e or –execute -f scriptname or -filename scriptname • Specify a parameter setting -p or –parameter example: -p key1=value1 –p key2=value2 • List the properties that Pig will use if they are set by the user -h properties • Display the version -version
  • 66. © Hortonworks Inc. 2012 Page 71 Grunt – HDFS Commands • Grunt acts as a shell to access HDFS • Commands include: fs -ls fs -cat filename fs -copyFromLocal localfile hdfsfile fs -copyToLocal hdfsfile localfile fs -rm filename fs -mkdir dirname fs -mv fromLocation/filename toLocation/filename
  • 67. © Hortonworks Inc. 2012 Page 72 Pig’s Data Model • 6 Scalar Types – int, long, float, double, chararray, bytearray • 3 Complex types – Tuple: ordered set of values • (F, 66, 41000, 95103) – Bag: unordered collection of tuples • { (F, 66, 41000, 95103), (M, 40, 14000, 95102) } – Map: collection of key value pairs • [name#Bob, age#34]
  • 68. © Hortonworks Inc. 2012 Page 73 Relations • Pig Latin statements work with relations • A bag of tuples • Similar to a table in a relational database, where the tuples in the bag correspond to rows in a table • Unlike rows in a table – the tuples in a Pig relation do not have to contain the same number of fields – nor do the fields have to be the same data type • Relation schemas are optional
  • 69. © Hortonworks Inc. 2012 Page 74 Defining Relations • The LOAD command loads data from a file into a relation. The syntax looks like: where ‘data’ is either a filename or directory • Use the AS option to define a schema for the relation: • TIP: Use the DESCRIBE command to view the schema of a relation alias = LOAD ‘data’; alias = LOAD ‘data’ AS (name1:type,name2:type,...); DESCRIBE alias;
  • 70. © Hortonworks Inc. 2012 Page 75 A Relation with a Schema • Suppose we have the following data in HDFS: Tom,21,94085,62000 John,45,95014,25000 Joe,21,94085,50000 Larry,45,95014,36000 Hans,21,94085,80000 • The data above represents the name, age, ZIP code and salary of employees
  • 71. © Hortonworks Inc. 2012 Page 76 A Relation with a Schema • The following LOAD command defines a relation named employees with a schema: • The DESCRIBE command for employees outputs the following: employees = LOAD 'pig/input/File1' USING PigStorage(',') AS (name:chararray, age:int, zip:int, salary:double); describe employees; employees: {name: chararray,age: int,zip: int,salary: double}
  • 72. © Hortonworks Inc. 2012 Page 77 Default Schema Datatype • If not specified, the data type of a field in a relation defaults to bytearray • What will the data type be for each field in the following relation? employees = LOAD 'pig/input/File1' USING PigStorage(',') AS (name:chararray, age, zip:int, salary);
  • 73. © Hortonworks Inc. 2012 Page 78 Using a Schema • Defining a schema allows you to refer to the values of a relation by the name of the field in the schema • Because we defined a schema for the employees relation, the FILTER command can refer to the second field in the relation by the name “age” employees = LOAD 'pig/input/File1' USING PigStorage(',') AS (name:chararray,age:int, zip:int,salary:double); newhires = FILTER employees BY age <= 21;
  • 74. © Hortonworks Inc. 2012 Page 79 Relations without a Schema • If a relation does not define a schema, then Pig will simply load the data anyway (because “pigs eat anything”) • The output of the above DESCRIBE command is: employees = LOAD 'pig/input/File1' USING PigStorage(','); DESCRIBE employees; Schema for employees unknown.
  • 75. © Hortonworks Inc. 2012 Page 80 Relations without a Schema • Without a schema, a field is referenced by its position within the relation • $0 is the first field, $1 is the second field, and so on • The output of the above commands is: employees = LOAD 'pig/input/File1' USING PigStorage(','); newhires = FILTER employess BY $1 <= 21; DUMP newhires; (Tom,21,94085,5000.0) (Joe,21,94085,50000.0) (Hans,21,94085,80000.0)
  • 76. © Hortonworks Inc. 2012 Page 81 • Data elements may be null – Null means that the data element value is “undefined” • In a LOAD command, null is automatically inserted for missing or invalid fields • Example: LOAD ‘a.txt’ AS (a1:int, a2:int) using PigStorage(‘,’); Is loaded as:This data: Nulls in Pig 1,2,3 5 (1,2) (5, null) 6,bye (6, null)
  • 77. © Hortonworks Inc. 2012 Page 82 The GROUP Operator • The GROUP operator groups together tuples based on a specified key • The usage for GROUP is: x = GROUP alias BY expression – alias = the name of the existing relation that you want to group together – expression = a tuple expression that is the key you want to group by • The result of the GROUP command is a new relation
  • 78. © Hortonworks Inc. 2012 Page 83 A GROUP Example • The output of DESCRIBE is: employees = LOAD 'pig/input/File1' USING PigStorage(',') AS (name:chararray,age:int, zip:int,salary:double); a = GROUP employees BY salary; DESCRIBE a; a: {group: double, employees: {(name: chararray, age:int, zip:int, salary: double) } }
  • 79. © Hortonworks Inc. 2012 Page 84 More GROUP Examples daily = LOAD ‘NYSE_daily’ AS (exchange, stock, date, dividends); grpd = GROUP daily BY (exchange, stock); DESCRIBE grpd; grpd: {group: (exchange:bytearray, stock:bytearray), daily: {{exchange:bytearray, stock:bytearray, date:bytearray, dividends: bytearray)}} daily = LOAD ‘NYSE_daily’ AS (exchange, stock); grpd = GROUP daily BY stock; cnt = FOREACH grpd GENERATE group, COUNT (daily);
  • 80. © Hortonworks Inc. 2012 Page 85 The JOIN Operator • The JOIN operator performs an inner join on two or more relations based on common field values • The syntax for JOIN is: x = JOIN alias BY expression, alias BY expression,… – alias = an existing relation – expression = a field of the relation • The result of JOIN is a flat set of tuples
  • 81. © Hortonworks Inc. 2012 Page 86 A JOIN Example • Suppose we add another set of data that contains the employee’s name and a phone number: Tom,4085551211 Tom,6505550123 John,4085554332 Joe,4085559898 Joe,4085557777
  • 82. © Hortonworks Inc. 2012 Page 87 A JOIN Example • The output of the DESCRIBE above is: e1 = LOAD 'pig/input/File1' USING PigStorage(',') AS (name:chararray,age:int, zip:int,salary:double); e2 = LOAD 'pig/input/File2' USING PigStorage(',') AS (name:chararray,phone:chararray); e3 = JOIN e1 BY name, e2 BY name; DESCRIBE e3; e3: {e1::name:chararray, e1::age:int, e1::zip:int,e1::salary:double, e2::name:chararray,e2::phone:chararray}
  • 83. © Hortonworks Inc. 2012 Page 88 A JOIN Example • The JOIN output looks like: grunt> DUMP e3; (Joe,21,94085,50000.0,Joe,4085559898) (Joe,21,94085,50000.0,Joe,4085557777) (Tom,21,94085,5000.0,Tom,4085551211) (Tom,21,94085,5000.0,Tom,6505550123) (John,45,95014,25000.0,John,4085554332)
  • 84. © Hortonworks Inc. 2012 Page 89 The FOREACH Operator • The FOREACH operator transforms data into a new relation based on the columns of the data • The syntax looks like: x = FOREACH alias GENERATE expression – alias = an existing relation – expression = an expression that determines the output
  • 85. © Hortonworks Inc. 2012 Page 90 A FOREACH Example • The output of this example is a bag: e1 = LOAD 'pig/input/File1' USING PigStorage(',') AS (name:chararray,age:int, zip:int,salary:double); f = FOREACH e1 GENERATE age,salary; DESCRIBE f; DUMP f; f: {age:int, salary:double} (21,5000.0) (45,25000.0) (21,50000.0) (45,36000.0) (21,80000.0)
  • 86. © Hortonworks Inc. 2012 Page 91 Using FOREACH on Groups e1 = LOAD 'pig/input/File1' USING PigStorage(',') AS (name:chararray,age:int, zip:int,salary:double); g = GROUP e1 BY age; DESCRIBE g; g: {group: int,e1: {(name:chararray, age:int, zip:int, salary:double)}} f = FOREACH g GENERATE group, SUM(e1.salary); DESCRIBE f; f: {group: int,double} DUMP f; (21,135000.0) (45,61000.0)
  • 87. © Hortonworks Inc. 2012 Page 92 Pig Latin Structured Processing Flow • Pig Latin script describes a directed acyclic graph (DAG) – The edges are data flows and the nodes are operators that process the data LOAD FOREACH JOIN LOAD FOREACH 1 2 3 4 5 6 7
  • 88. © Hortonworks Inc. 2013 Thank You! Questions & Answers Page 96

Notes de l'éditeur

  1. I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when Eric Baldeschwieler – known as “E14” – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
  2. In that capacity,Arun allows Hortonworks to be instrumental in working with the community to drive the roadmap for Core Hadoop, where the focus today is on things like YARN, MapReduce2, HDFS2 and more.For Core Hadoop, in absolute terms, Hortonworkers have contributed more than twice as many lines of code as the next closest contributor, and even more if you include Yahoo, our development partner. Taking such a prominent role also enables us to ensure that our distribution integrates deeply with the ecosystem: on both choice of deployment platforms such as Windows, Azure and more, but also to create deeply engineered solutions with key partners such as Teradata.And consistent with our approach, all of this is done in 100% open source.
  3. Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
  4. A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
  5. The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).
  6. It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
  7. At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.HDFS: Self-healing, distributed file system for multi-structured data; breaks files into blocks &amp; stores redundantly across clusterMapReduce: Framework for running large data processing jobs in parallel across many nodes &amp; combining resultsYARN: New application management framework that enables Hadoop to go beyond MapReduce appsEnterprise-ready servicesHigh availability, disaster recovery, snapshots, security, …
  8. Beyond Core and Platform Services, we must add a set of Data Services that enable the full data lifecycle. This includes capabilities to:Store dataProcess dataAccess dataFor example: how do we maintain consistent metadata information required to determine how best to query data stored in HDFS? The answer: a project called Apache HCatalogOr how do we access data stored in Hadoop from SQL-oriented tools? The answer: with projects such as Hive, which is the defacto standard for accessing data stored in HDFS.All of these are broadly captured under the category of “data services”.Apache HCatalog: Metadata &amp; Table ManagementMetadata service that enables users to access Hadoop data as a set of tables without needing to be concerned with where or how their data is storedEnables consistent data sharing and interoperability across data processing tools such as Pig, MapReduce and HiveEnables deep interoperability and data access with systems such as Teradata, SQL Server, etc.Apache Hive: SQL Interface for HadoopThe de-facto SQL-like interface for Hadoop that enables data summarization, ad-hoc query, and analysis of large datasetsConnects to Excel, Microstrategy, PowerPivot, Tableau and other leading BI tools via Hortonworks Hive ODBC DriverHive currently serves batch and non-interactive use cases; in 2013, Hortonworks is working with Hive community to extend use cases to interactive query. Cloudera, on the other hand, has chosen to abandon Hive in lieu of Cloudera Impala (a Cloudera controlled technology aimed at the analytics market and solely focused on non-operational interactive query use cases)Apache HBase: NoSQL DB for Interactive AppsNon-relational, columnar database that provides a way for developers to create, read, update, and delete data in Hadoop in a way that performs well for interactive applicationsCommonly used for serving “intelligent applications” that predict user behavior, detect shifting usage patterns, or recommend ways for users to engageWebHDFS: Web service interface for HDFSScalable REST API that enables easy and scalable access to HDFS Move files in &amp; out and delete from HDFS; leverages parallelism of clusterPerform file and directory functionswebhdfs://&lt;HOST&gt;:&lt;HTTP PORT&gt;/PATHIncluded in versions 1.0 and 2.0 of Hadoop; created &amp; driven by HortonworkersTalend Open Studio for Big Data: open source ETL tool available as an optional download with HDPIntuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and PigOozie scheduling allows you to manage and stage jobs Connectors for any database, business application or systemIntegrated HCatalog storage
  9. HCatalog – metadata shared across whole platformFile locations become abstract (not hard-coded)Data types become shared (not redefined per tool)Partitioning and HDFS-optimized
  10. Any data management platform that is operated at any reasonable scale requires a management technology – for example SQL Server Management Studio for SQL Server, or Oracle Enterprise Manager for Oracle DB, etc. Hadoop is no exception, and means Apache Ambari, which is increasingly being recognized as foundational to the operation of Hadoop infrastructures. It allows users to provision, manage and monitor a cluster and provides a set of tools to visualize and diagnose operational issues. There are other projects in this category (such as Oozie) but Ambari is really the most influential.Apache Ambari: Management &amp; MonitoringMake Hadoop clusters easy to operateSimplified cluster provisioning with a step-by-step install wizardPre-configured operational metrics for insight into health of Hadoop servicesVisualization of job and task execution for visibility into performance issuesComplete RESTful API for integrating with existing operational toolsIntuitive user interface that makes controlling a cluster easy and productive
  11. So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
  12. And finally, because any enterprise runs a heterogeneous set of infrastructures, we ensure that HDP runs on your choice of infrastructure. Whether this is Linux, Windows (HDP is the only distribution certified for Windows), on a cloud platform such as Azure or Rackspace, or in an appliance, we ensure that all of them are supported and that this work is all contributed back to the open source community.
  13. Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configu- ration choice is to write to local disk as well as a remote NFS mount.
  14. Despite its name the SNN does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged name- space image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary. Edit LogWhen a filesystem client performs a write operation (such as creating or moving a file), it is first recorded in the edit log. The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified. The in-memory metadata is used to serve read requests. The edit log is flushed and synced after every write before a success code is returned to the client. For namenodes that write to multiple directories, the write must be flushed and synced to every copy before returning successfully. This ensures that no operation is lost due to machine failure. fsimageThe fsimagefile is a persistent checkpoint of the filesystem metadata. However, it is not updated for every filesystem write operation, since writing out the fsimagefile, which can grow to be gigabytes in size, would be very slow. This does not compromise resilience, however, because if the namenode fails, then the latest state of its metadata can be reconstructed by loading the fsimagefrom disk into memory, then applying each of the operations in the edit log. In fact, this is precisely what the namenode does when it starts up. According to Apache, SecondaryNameNode is deprecated. See: http://hadoop.apache.org/hdfs/docs/r0.22.0/hdfs_user_guide.html#Secondary+NameNode (Accessed May 2012). – LWSNN configuration options:dfs.namenode.checkpoint.period valuedfs.namenode.checkpoint.size valuedfs.http.addressvaluepoint to name nodes port 50070Gets Fsimage and EditLogdfs.namenode.checkpoint.dirvalueelse defaults to /tmp
  15. Data Disk Failure, Heartbeats and Re-ReplicationEach DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.Cluster RebalancingThe HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.Data IntegrityIt is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.Metadata Disk FailureThe fsImage and the editLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.
  16. Use the put commandIf bar is a directory then foo will be placed in the directory barIf no file or directory exists of name bar, file foo will be named If bar is already a file in HDFS then an error is returnedIf subNamedoes not exist then the file will be created in the directoryNote: -copyFromLocalcan be used instead of -put
  17. Can display using catCopy a file from HDFS to the local file system with –get command$ bin/hadoopfs –get foo LocalFooOther commands$ hadoopfs -rmrdirectory|file Recursively
  18. A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system (HDFS). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasksrunning on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.M/R supports multiple nodes processing contiguous data sets in parallel. Often, one node will “Map”, while another “Reduces”. In between, is the “shuffle”. We’ll cover all these in more detail.Basic “phases”, or steps, are displayed here. Other custom phases may be added.InMapReduce data is passed around as key/value pairs.QUESTION: is the “Reduce” phase required? NOINSTRUCTOR SUGGESTION: During the second day start – have one or more students re-draw this on the white board to re-iterate its importance.
  19. MapReduce consists of Java classes that can process big data in parallel processes. It receives three inputs, a source collection, a Map function and a Reduce function, then returns a new result data collection.The algorithm is composed of a few steps, the first one executes the map() function to each item within the source collection. Map will return zero or may instances of Key/Value objects.Map’s responsibility is to convert an item from the source collection to zero or many instances of Key/Value pairs.In the next step , an algorithm will sort all Key/Value pairs to create new object instances where all values will be grouped by Key.reduce() then groups each by Key/Value instance. It returns a new instance to be included into the result collection.
  20. A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.An InputFormat determines how the data is split across map tasksTextInputFormat is the default InputFormatIt divides the input data into an InputSplit[] arrayEach InputSplit is given an individual mapEach InputSplit is associated with a destinationnodethat should be used to read the data so Hadoop can try to assign the computation to local data locationThe RecordReader class:Reads input data and produces key-value pairs to be passed into the MapperMay control how data is decompressedConverts data to Java types that MapReduce can work withMapReduce relies on the InputFormat to:Validate the input-specification of the job.Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.The default behavior of InputFormatis to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. [However, HDFS blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via the config parameter mapred.min.split.size.] * validate thisLogical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.If the defaultTextInputFormat is used for a given job, the framework detects compressed inputfiles (i.e., with the .gz extension)and automatically decompresses them using an appropriate CompressionCodec. Note that, depending on the codec used, compressed files cannot be split and each compressed file is processed in its entirety by a single mapper.gzip cannot be split, bzip and lzocan.
  21. Mapping is often used for typical “ETL” processes, such as filtering, transformation, and categorization.Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.The OutputCollector is provided by theframework to collect data output by the Mapper or the Reducer.The Reporter class is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive. In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).Applications can also update Counters using the Reporter.Parallel Map ProcessesThe number of maps is usually driven by the number of inputs, orthe total number of blocks in the input files.The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.If you expect 10TB of input data and have a block size of 128MB, you&apos;ll end up with 82,000 maps, unless you use setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.
  22. The Partitioner controls the balancing of the keys of intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function on the key (or portion of it)AllMapper outputs are sorted and then partitioned per ReducerThe total number of partitions is the same as the number of Reduce tasks for the job. ThePartitioner is like a load balancer, that controls which one of the Reduce tasks the intermediate key (and hence the record) is sent to for reduction.Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer to determine final output. Users can control that grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).MapReduce comes bundled with a library of generally useful Mappers, Reducers, and Partitioner classes.
  23. MapReduce partitions data among indiidual Reducers, usually running on separate nodes.The Shuffle phase is determined by how a Partitioner (embedded or custom) partitions value pairs to ReducersThe shuffle is where refinements and improvements are continually being made, In many ways, the shuffle is the heart of MapReduce and is where the “magic” happens.
  24. The Sort phase guarantees that the input to every Reducer is sorted by key.
  25. ARecuder extends MapReduceBase(). It implements the Reducer interfaceReceives output from multiple mappersreduce() method is called for each &lt;key, (list of values)&gt; pair in the grouped inputsIt sorts data by key of key/value pairTypically iterates through the values associated with a keyIt is perfectly legal to set the number of reduce-tasks to zero if no reduction is desired.In this case the outputs of the map-tasks go directly to HDFS, to the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out toHDFS.Reducer has 3 primary phases: Shuffle, Sort and Reduce:ShuffleInput to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.SortThe framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.Secondary SortIf equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.ReduceIn this phase the reduce() method is called for each &lt;key, (list of values)&gt; pair in the grouped inputs.The output of the reduce task is typically written to HDFS via OutputCollector.collect(WritableComparable, Writable).Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.The output of the Reducer is not sorted.
  26. OutputFormat has two responsibilities: determining where and how data is written. Itdetermines where by examining the JobConf and checking to make sure it is a legitimate destination. The &apos;how&apos; is handled by the getRecordWriter() function.From Hadoop&apos;s point of view, these classes come together when the MapReduce job finishes and each reducer has produced a stream of key-value pairs. Hadoop calls checkOutputSpecs with the job&apos;s configuration. If the function runs without throwing an Exception, it moves on and calls getRecordWriter which returns an object which can write that stream of data. When all of the pairs have been written, Hadoop calls the close function on the writer, committing that data to HDFS and finishing the responsibility of that reducer.MapReducerelies on the OutputFormat of the job to:Validate the output-specification of the job; for example, check that the output directory doesn&apos;t already existProvide the RecordWriter implementation used to write the output files of the job. Output files are stored in a FileSystem (usually HDFS)TextOutputFormat is the default OutputFormat.OutputFormats specify how to serialize data by providing a implementation of RecordWriter. RecordWriter classes handle the job of taking an individual key-value pair and writing it to the location prepared by the OutputFormat. There are two main functions to a RecordWriter implements: &apos;write&apos; and &apos;close&apos;. The &apos;write&apos; function takes key-values from the MapReduce job and writes the bytes to disk. The default RecordWriter is LineRecordWriter, part of the TextOutputFormat mentioned earlier. It writes:The key&apos;s bytes (returned by the getBytes() function)a tab character delimiterthe value&apos;s bytes (again, produced by getBytes())a newline character.The &apos;close&apos; function closes the Hadoop data stream to the output file.We&apos;ve talked about the format of output data, but where is it stored? Again, you&apos;ve probably seen the output of a job stored in many ‘part&apos; files under the output directory like so:|-- output-directory| |-- part-00000| |-- part-00001| |-- part-00002| |-- part-00003| |-- part-00004 &apos;-- part-00005from http://www.infoq.com/articles/HadoopOutputFormat
  27. Mostly &quot;boilerplate&quot; with fill in the blank for datatypesHadoopdatatypes have well defined methods to get/put values into themStandard pattern to declare your output objects outside map() loop scopeStandard pattern to 1) get Java datatypes 2) Do your logic 3) Put into HadoopdatatypesThis map() method is called in a &quot;loop&quot; with successive key, value pairs.Each time through, you typically write key, value pairs to the reduce phaseThe key, value pairs go to the Hadoop Framework where the key is hashed to choose a reducer
  28. In the Reducer:Reduce code is copied to multiple nodes in the clusterThe copies are identicalAll key,value pairs with identical keys are placed into a pair of (key, Iterator&lt;values&gt;) when passed to the reducer.Each invocation of the reduce() method is passed all values associated with a particular key.The keys are guaranteed to arrive in sorted orderEach incoming value is a numeric count of words in a particular line of textThe multiple values represent multiple lines of text processed by multiple map() methodsThe values are in an Iterator, and are summed in a loopThe sum, with its associated key, is sent to a disk file via the Hadoop FrameworkThere are typically many copies of your reduce codeIncoming key,value pairs are of the datatypes that map emitsOutput key,value pairs go to the HDFS, which writes them to diskThe Hadoop Framework constructs the Iterator from map values that are sent to this Reducer
  29. An inverted index is a data structure that stores a mapping from content to its locations in a database file, or in a document or a set of documentsCan provide what to display for a searchNOTE: need to deal with OutputCollector() here.
  30. Pig Latin is a data flow language – a sequence of steps where each step is an operation or commandDuring execution each statement is processed by the Pig interpreter and checked for syntax errorsIf a statement is valid, it gets added to a logical plan built by the interpreterHowever, the step does not actually execute until the entire script is processed, unless the step is a DUMP or STORE command, in which case the logical plan is compiled into a physical plan and executed
  31. After the LOAD statement, point out that nothing is actually loaded! The LOAD command only defines a relationThe FILTER and GROUP commands are fairly self explanatoryThe HDFS is not even touched until the STORE command executes, at which point a MapReduce application is built from the Pig Latin statements shown here.
  32. GruntPig’s interactive shellEnables users to enter Pig Latin interactivelyGrunt will do basic syntax and semantic checking as you enter each lineProvides a shell for users to interact with HDFSTo enter Grunt and use local file system instead:$ pig –x localIn the example script:A is called a relation or aliasimmutable, recreated if reusedmyfile is read from your home directory in HDFSA has no schema associated with itbytearrayLOAD uses default function PigStorage() to load dataAssumes TAB delimited in HDFSThe entire file &apos;myfile&apos; will be read&quot;pigs eat anything&quot;The elements in A can be referenced by position if no schema is associated$0, $1, $2, ...
  33. Pig return codes:(type in page 18 of Pig)
  34. The complex types can contain data of any type
  35. The DESCRIBE output is:describe employeesemployees: {name: chararray,age: bytearray,zip: int,salary: bytearray}
  36. Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java, Python, etc. In Pig a null data element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc. In most procedural languages, a data value is said to be null when it is unset or does not point to a valid address or object. This difference in the concept of null is important and affects the way Pig treats null data, especially when operating on it.from Programming Pig, Alan Gates 2011
  37. twokey.pig collects all records with the same value for the provided key together into a bagWithin a single relation, group together tuples with the same group keyCan use keywords BY and ALL with GROUPCan optionally be passed as an aggregate function (count.pig)Can group on multiple keys if they are surrounded by parentheses