SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
©	
  Hortonworks	
  Inc.	
  2012	
  
Hadoop Workshop
Chris	
  Harris	
  
	
  
Twi6er	
  :	
  cj_harris5	
  
E-­‐mail	
  :	
  charris@hortonworks.com	
  
	
  
© Hortonworks Inc. 2013
Enhancing the Core of Apache Hadoop
Page 2
HADOOP	
  CORE	
  
PLATFORM	
  SERVICES	
   Enterprise Readiness
HDFS	
   YARN	
  (in	
  2.0)	
  
MAP	
  REDUCE	
  
Deliver high-scale
storage & processing
with enterprise-ready
platform services
Unique Focus Areas:
•  Bigger, faster, more flexible
Continued focus on speed & scale and
enabling near-real-time apps
•  Tested & certified at scale
Run ~1300 system tests on large Yahoo
clusters for every release
•  Enterprise-ready services
High availability, disaster recovery,
snapshots, security, …
© Hortonworks Inc. 2013
Page 3
HADOOP	
  CORE	
  
DATA	
  
SERVICES	
  
Distributed
Storage & Processing
PLATFORM	
  SERVICES	
   Enterprise Readiness
Data Services for Full Data Lifecycle
WEBHDFS	
  
HCATALOG	
  
HIVE	
  PIG	
  
HBASE	
  
SQOOP	
  
FLUME	
  
Provide data services to
store, process & access
data in many ways
Unique Focus Areas:
•  Apache HCatalog
Metadata services for consistent table
access to Hadoop data
•  Apache Hive
Explore & process Hadoop data via SQL &
ODBC-compliant BI tools
•  Apache HBase
NoSQL database for Hadoop
•  WebHDFS
Access Hadoop files via scalable REST API
•  Talend Open Studio for Big Data
Graphical data integration tools
© Hortonworks Inc. 2013
Operational Services for Ease of Use
Page 4
OPERATIONAL	
  
SERVICES	
  
DATA	
  
SERVICES	
  
Store,
Process and
Access Data
HADOOP	
  CORE	
  
Distributed
Storage & Processing
PLATFORM	
  SERVICES	
   Enterprise Readiness
OOZIE	
  
AMBARI	
  
Include complete
operational services for
productive operations
& management
Unique Focus Area:
•  Apache Ambari:
Provision, manage & monitor a cluster;
complete REST APIs to integrate with
existing operational tools; job & task
visualizer to diagnose issues
© Hortonworks Inc. 2013
Useful Links
• Hortonworks Sandbox:
–  http://hortonworks.com/products/hortonworks-sandbox
• Sample Data:
– http://internetcensus2012.bitbucket.org/paper.html
– http://data.worldbank.org
– Other speakers…
Page 5
© Hortonworks Inc. 2013
HDFS Architecture
© Hortonworks Inc. 2013
HDFS Architecture
NameNode	
  
NameSpace	
  
Block	
  Map	
  
Block	
  Management	
  
DataNode	
  
BL1	
   BL6	
  
BL2	
   BL7	
  
NameSpace	
  MetaData	
  	
  
Image	
  (Checkpoint)	
  	
  
And	
  Edit	
  Journal	
  Log	
  
Checkpoints	
  Image	
  and	
  	
  
Edit	
  Journal	
  Log	
  (backup)	
  
Secondary	
  
NameNode	
  
DataNode	
  
BL1	
   BL3	
  
BL6	
   BL2	
  
DataNode	
  
BL1	
   BL7	
  
BL8	
   BL9	
  
© Hortonworks Inc. 2013
HDFS Heartbeats
HDFS	
  
heartbeats	
  
Data	
  
Node	
  
daemon	
  
Data	
  
Node	
  
daemon	
  
Data	
  
Node	
  
daemon	
  
Data	
  
Node	
  
daemon	
  
“Im	
  datanode	
  X,	
  and	
  	
  
I’m	
  OK;	
  I	
  do	
  have	
  some	
  
new	
  informa8on	
  for	
  you:	
  
the	
  new	
  blocks	
  are	
  …”	
  
NameNode	
  
fsimage	
  
editlog	
  
© Hortonworks Inc. 2013
Basic HDFS File System Commands
Here are a few (of the almost 30) HDFS commands:
-cat: just like Unix cat – display file content
(uncompressed)
-text: just like cat – but works on compressed files
-chgrp,-chmod,-chown: just like the Unix command,
changes permissions
-put,-get,-copyFromLocal,-copyToLocal: copies files
from the local file system to the HDFS and vice-versa. Two
versions.
-ls, -lsr: just like Unix ls, list files/directories
-mv,-moveFromLocal,-moveToLocal: moves files
-stat: statistical info for any given file (block size, number
of blocks, file type, etc.)
© Hortonworks Inc. 2013
Commands Example
$ hadoop fs –ls /user/brian/!
$ hadoop fs -lsr!
$ hadoop fs –mkdir notes!
$ hadoop fs –put ~/training/commands.txt notes!
$ hadoop fs –chmod 777 notes/commands.txt!
$ hadoop fs –cat notes/commands.txt | more!
$ hadoop fs –rm notes/*.txt!
!
$ hadoop fs –put filenameSrc filenameDest!
$ hadoop fs –put filename dirName/fileName!
$ hadoop fs –cat foo!
$ hadoop fs –get foo LocalFoo!
$ hadoop fs –rmr directory|file!
!
© Hortonworks Inc. 2013
MapReduce
© Hortonworks Inc. 2013
A Basic MapReduce Job
map() implemented
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text(); !!
!
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {!
String line = value.toString();!
StringTokenizer tokenizer = new
StringTokenizer(line);!
while (tokenizer.hasMoreTokens()) {!
word.set(tokenizer.nextToken());!
output.collect(word, one);!
}!
}!
© Hortonworks Inc. 2013
A Basic MapReduce Job
reduce() implemented
!
private final IntWritable totalCount = new IntWritable();!
!
public void reduce(Text key, !
!Iterator<IntWritable> values, OutputCollector<Text,
!IntWritable> output, Reporter reporter) throws
IOException {!
int sum = 0;!
while (values.hasNext()) {!
sum += values.next().get();!
}!
! totalCount.set(sum);!
output.collect(key, totalCount);!
}!
© Hortonworks Inc. 2013
Pig
© Hortonworks Inc. 2013
What is Pig?
• Pig is an extension of Hadoop that simplifies the
ability to query large HDFS datasets
• Pig is made up of two main components:
– A SQL-like data processing language called Pig Latin
– A compiler that compiles and runs Pig Latin scripts
• Pig was created at Yahoo! to make it easier to analyze
the data in your HDFS without the complexities of
writing a traditional MapReduce program
• With Pig, you can develop MapReduce jobs with a few
lines of Pig Latin
© Hortonworks Inc. 2013
Running Pig
A Pig Latin script executes in three modes:
1.  MapReduce: the code executes as a MapReduce application
on a Hadoop cluster (the default mode)
2.  Local: the code executes locally in a single JVM using a local
text file (for development purposes)
3.  Interactive: Pig commands are entered manually at a
command prompt known as the Grunt shell
$ pig myscript.pig
$ pig -x local myscript.pig
$ pig!
grunt>
© Hortonworks Inc. 2013
Understanding Pig Execution
• Pig Latin is a data flow language
• During execution each statement is processed by the
Pig interpreter
• If a statement is valid, it gets added to a logical plan
built by the interpreter
• The steps in the logical plan do not actually execute
until a DUMP or STORE command
© Hortonworks Inc. 2013
A Pig Example
• The first three commands are built into a logical plan
• The STORE command triggers the logical plan to be
built into a physical plan
• The physical plan will be executed as one or more
MapReduce jobs
logevents = LOAD ‘input/my.log’ AS (date, level, code,
message);
severe = FILTER logevents BY (level == ‘severe’
AND code >= 500);
grouped = GROUP severe BY code;
STORE grouped INTO ‘output/severeevents’;
© Hortonworks Inc. 2013
Hive
© Hortonworks Inc. 2013
What is Hive?
• Hive is a subproject of the Apache Hadoop project
that provides a data warehousing layer built on top of
Hadoop
• Hive allows you to define a structure for your
unstructured big data, simplifying the process of
performing analysis and queries by introducing a
familiar, SQL-like language called HiveQL
• Hive is for data analysts familiar with SQL who need
to do ad-hoc queries, summarization and data
analysis on their HDFS data
© Hortonworks Inc. 2013
Hive is not…
• Hive is not a relational database
• Hive uses a database to store metadata, but the data
that Hive processes is stored in HDFS
• Hive is not designed for on-line transaction
processing and does not offer real-time queries and
row level updates
© Hortonworks Inc. 2013
Pig vs. Hive
• Pig and Hive work well together
• Hive is a good choice:
–  when you want to query the data
– when you need an answer to a specific questions
– if you are familiar with SQL
• Pig is a good choice:
– for ETL (Extract -> Transform -> Load)
– preparing your data so that it is easier to analyze
– when you have a long series of steps to perform
• Many businesses use both Pig and Hive together
© Hortonworks Inc. 2013
What is a Hive Table?
• A Hive table consists of:
– Data: typically a file or group of files in HDFS
– Schema: in the form of metadata stored in a relational database
• Schema and data are separate.
– A schema can be defined for existing data
– Data can be added or removed independently
– Hive can be "pointed" at existing data
• You have to define a schema if you have existing data
in HDFS that you want to use in Hive
© Hortonworks Inc. 2013
HiveQL
• Hive’s SQL like language, HiveQL, uses familiar
relational database concepts such as tables, rows,
columns and schema
• Designed to work with structured data
• Converts SQL queries to into MapReduce jobs
• Supports uses such as:
– Ad-hoc queries
– Summarization
– Data Analysis
© Hortonworks Inc. 2013
Running Jobs with the Hive Shell
• Primary way people use to interact with Hive
$ hive
hive>
• Can run in the shell in a non-interactive way
$ hive –f myhive.q
– Use –S option to have only the results show
© Hortonworks Inc. 2013
Hive Shell - Information
• At terminal enter:
– $ hive
• List all properties and values:
– hive> set –v
• List and describe tables
– hive> show tables;
– hive> describe <tablename>;
– hive> describe extended <tablename>;
• List and describe functions
– hive> show functions;
– hive> describe function <functionname>;
© Hortonworks Inc. 2013
Hive Shell – Querying Data
• Selecting Data
– hive> SELECT * FROM students;
– hive> SELECT * FROM students
WHERE gpa > 3.6 SORT BY gpa ASC;
© Hortonworks Inc. 2013
HiveQL
• HiveQL is similar to other SQLs
• User does not need to know Map/Reduce
• HiveQL is based on the SQL-92 specification
• Supports multi-table inserts via your code
© Hortonworks Inc. 2013
Table Operations
• Defining a table:
hive> CREATE TABLE mytable (name chararray, age int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
• ROW FORMAT is a Hive-unique command that indicate that
each row is comma delimited text
• HiveQL statements are terminated with a semicolon ';'
• Other table operations:
– SHOW TABLES
– CREATE TABLE
– ALTER TABLE
– DROP TABLE
© Hortonworks Inc. 2013
SELECT
• SELECT
– Simple query example:
– SELECT * FROM mytable;
• Supports the following:
– WHERE clause
– ALL and DISTINCT
– GROUP BY and HAVING
– LIMIT clause
– Rows returned are chosen at random
– Can use REGEX Column Specification
– Example:
– SELECT ' (ds|hr)?+.+' FROM sales;
© Hortonworks Inc. 2013
JOIN - Inner Joins
•  Inner joins are implemented with ease:
SELECT * FROM students;
Steve 2.8
Raman 3.2
Mary 3.9
SELECT * FROM grades;
2.8 B
3.2 B+
3.9 A
SELECT students.*, grades.*
FROM students JOIN grades ON (students.grade =
grades.grade)
Steve 2.8 2.8 B
Raman 3.2 3.2 B+
Mary 3.9 3.9 A
© Hortonworks Inc. 2013
JOIN – Outer Joins
• Allows for the finding of rows with non-matches in the
tables being joined
• Outer Joins can be of three types
– LEFT OUTER JOIN
– Returns a row for every row in the first table entered
– RIGHT OUTER JOIN
– Returns a row for every row in the second table entered
– FULL OUTER JOIN
– Returns a row for every row from both tables
© Hortonworks Inc. 2013
Sorting
• ORDER BY
– Sorts but sets the number of reducers to 1
• SORT BY
– Multiple reducers with a sorted file from each
© Hortonworks Inc. 2013
Hive Summary
• Not suitable for complex machine learning algorithms
• All jobs have a minimum overhead and will take time
just for set up
– Still Hadoop MapReduce on a cluster
• Good for batch jobs on large amounts of append only
data
– Immutable filesystem
– Does not support row level updates (except through file deletion or
creation)
© Hortonworks Inc. 2013
HCatalog
© Hortonworks Inc. 2013
What is HCatalog?
• Table management and storage management layer for
Hadoop
• HCatalog provides a shared schema and data type
mechanism
– Enables users with different data processing tools – Pig,
MapReduce and Hive to have data interoperability
–  HCatalog provides read and write interfaces for Pig, MapReduce
and Hive to HDFS (or other data sources)
– HCatalog’s data abstraction presents users with a relational view
of data
• Command line interface for data manipulation
• Designed to be accessed though other programs
such as Pig, Hive, MapReduce and HBase
• HCatalog installs on top of Hive
© Hortonworks Inc. 2013
HCatalog DDL
• CREATE/ALTER/DROP Table
• SHOW TABLES
• SHOW FUNCTIONS
• DESCRIBE
• Many of the commands in Hive are supported
– Any command which is not supported throws an exception and
returns the message "Operation Not Supported".
© Hortonworks Inc. 2013
Accessing HCatalog Metastore through CLI
• Using HCatalog client
– Execute a script file
hcat –f “myscript.hcatalog”
– Execute DDL
hcat –e 'create table mytable(a int);'
© Hortonworks Inc. 2013
Define a New Schema
• A schema is defined as an HCatalog table:
create table mytable (
id int,
firstname string,
lastname string
)
comment 'An example of an HCatalog table'
partitioned by (birthday string)
stored as sequencefile;
© Hortonworks Inc. 2013
Pig Specific HCatStorer Interface
• Used with Pig scripts to write data to HCatalog managed
tables.
• Accepts a table to write to and optionally a specification of
partition keys to create a new partition
• HCatStorer is implemented on top of HCatOutputFormat
– HCatStorer is accessed via a Pig store statement.
– Storing into table partitioned on month, date, hour
…
STORE my_processed_data INTO ‘dbname.tablenmame’
USING
org.apache.hcatalog.pig.HCatStorer(‘month=12,
date=25, hour=0300’,
‘a:int,b:chararray,c:map[]’);
© Hortonworks Inc. 2013
Thank You!
Questions & Answers
Page 41

Contenu connexe

Tendances

Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionDataWorks Summit/Hadoop Summit
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache HiveCarl Steinbach
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2t3rmin4t0r
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 

Tendances (20)

Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application Adoption
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache Hive
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 

En vedette

Sakilla Datawarehouse and Datamining
Sakilla Datawarehouse and DataminingSakilla Datawarehouse and Datamining
Sakilla Datawarehouse and DataminingUmberto Griffo
 
OOPs Development with Scala
OOPs Development with ScalaOOPs Development with Scala
OOPs Development with ScalaKnoldus Inc.
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive IntroductionHanborq Inc.
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Hadoop Fundamentals Certificate.PDF
Hadoop Fundamentals Certificate.PDFHadoop Fundamentals Certificate.PDF
Hadoop Fundamentals Certificate.PDFSantosh Singh
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business IntelligenceHGanesh
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
People Pattern: "The Science of Sharing"
People Pattern: "The Science of Sharing"People Pattern: "The Science of Sharing"
People Pattern: "The Science of Sharing"People Pattern
 
Cloudera Developer Spark Hadoop I
Cloudera Developer Spark Hadoop ICloudera Developer Spark Hadoop I
Cloudera Developer Spark Hadoop IUmberto Griffo
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...Evans Ye
 
Webanalytics with haddop and Hive
Webanalytics with haddop and HiveWebanalytics with haddop and Hive
Webanalytics with haddop and HiveLe Kien Truc
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 

En vedette (20)

Sakilla Datawarehouse and Datamining
Sakilla Datawarehouse and DataminingSakilla Datawarehouse and Datamining
Sakilla Datawarehouse and Datamining
 
OOPs Development with Scala
OOPs Development with ScalaOOPs Development with Scala
OOPs Development with Scala
 
Pig statements
Pig statementsPig statements
Pig statements
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Problems with beavers
Problems with beavers Problems with beavers
Problems with beavers
 
Hadoop Fundamentals Certificate.PDF
Hadoop Fundamentals Certificate.PDFHadoop Fundamentals Certificate.PDF
Hadoop Fundamentals Certificate.PDF
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
People Pattern: "The Science of Sharing"
People Pattern: "The Science of Sharing"People Pattern: "The Science of Sharing"
People Pattern: "The Science of Sharing"
 
Cloudera Developer Spark Hadoop I
Cloudera Developer Spark Hadoop ICloudera Developer Spark Hadoop I
Cloudera Developer Spark Hadoop I
 
Hadoop Pig
Hadoop PigHadoop Pig
Hadoop Pig
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
Webanalytics with haddop and Hive
Webanalytics with haddop and HiveWebanalytics with haddop and Hive
Webanalytics with haddop and Hive
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Sqoop
SqoopSqoop
Sqoop
 

Similaire à Yahoo! Hack Europe Workshop

Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azureMostafa
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Nicolas Morales
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in AzureMostafa
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptxVIJAYAPRABAP
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 

Similaire à Yahoo! Hack Europe Workshop (20)

SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hive
HiveHive
Hive
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 

Plus de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Plus de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Dernier

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Yahoo! Hack Europe Workshop

  • 1. ©  Hortonworks  Inc.  2012   Hadoop Workshop Chris  Harris     Twi6er  :  cj_harris5   E-­‐mail  :  charris@hortonworks.com    
  • 2. © Hortonworks Inc. 2013 Enhancing the Core of Apache Hadoop Page 2 HADOOP  CORE   PLATFORM  SERVICES   Enterprise Readiness HDFS   YARN  (in  2.0)   MAP  REDUCE   Deliver high-scale storage & processing with enterprise-ready platform services Unique Focus Areas: •  Bigger, faster, more flexible Continued focus on speed & scale and enabling near-real-time apps •  Tested & certified at scale Run ~1300 system tests on large Yahoo clusters for every release •  Enterprise-ready services High availability, disaster recovery, snapshots, security, …
  • 3. © Hortonworks Inc. 2013 Page 3 HADOOP  CORE   DATA   SERVICES   Distributed Storage & Processing PLATFORM  SERVICES   Enterprise Readiness Data Services for Full Data Lifecycle WEBHDFS   HCATALOG   HIVE  PIG   HBASE   SQOOP   FLUME   Provide data services to store, process & access data in many ways Unique Focus Areas: •  Apache HCatalog Metadata services for consistent table access to Hadoop data •  Apache Hive Explore & process Hadoop data via SQL & ODBC-compliant BI tools •  Apache HBase NoSQL database for Hadoop •  WebHDFS Access Hadoop files via scalable REST API •  Talend Open Studio for Big Data Graphical data integration tools
  • 4. © Hortonworks Inc. 2013 Operational Services for Ease of Use Page 4 OPERATIONAL   SERVICES   DATA   SERVICES   Store, Process and Access Data HADOOP  CORE   Distributed Storage & Processing PLATFORM  SERVICES   Enterprise Readiness OOZIE   AMBARI   Include complete operational services for productive operations & management Unique Focus Area: •  Apache Ambari: Provision, manage & monitor a cluster; complete REST APIs to integrate with existing operational tools; job & task visualizer to diagnose issues
  • 5. © Hortonworks Inc. 2013 Useful Links • Hortonworks Sandbox: –  http://hortonworks.com/products/hortonworks-sandbox • Sample Data: – http://internetcensus2012.bitbucket.org/paper.html – http://data.worldbank.org – Other speakers… Page 5
  • 6. © Hortonworks Inc. 2013 HDFS Architecture
  • 7. © Hortonworks Inc. 2013 HDFS Architecture NameNode   NameSpace   Block  Map   Block  Management   DataNode   BL1   BL6   BL2   BL7   NameSpace  MetaData     Image  (Checkpoint)     And  Edit  Journal  Log   Checkpoints  Image  and     Edit  Journal  Log  (backup)   Secondary   NameNode   DataNode   BL1   BL3   BL6   BL2   DataNode   BL1   BL7   BL8   BL9  
  • 8. © Hortonworks Inc. 2013 HDFS Heartbeats HDFS   heartbeats   Data   Node   daemon   Data   Node   daemon   Data   Node   daemon   Data   Node   daemon   “Im  datanode  X,  and     I’m  OK;  I  do  have  some   new  informa8on  for  you:   the  new  blocks  are  …”   NameNode   fsimage   editlog  
  • 9. © Hortonworks Inc. 2013 Basic HDFS File System Commands Here are a few (of the almost 30) HDFS commands: -cat: just like Unix cat – display file content (uncompressed) -text: just like cat – but works on compressed files -chgrp,-chmod,-chown: just like the Unix command, changes permissions -put,-get,-copyFromLocal,-copyToLocal: copies files from the local file system to the HDFS and vice-versa. Two versions. -ls, -lsr: just like Unix ls, list files/directories -mv,-moveFromLocal,-moveToLocal: moves files -stat: statistical info for any given file (block size, number of blocks, file type, etc.)
  • 10. © Hortonworks Inc. 2013 Commands Example $ hadoop fs –ls /user/brian/! $ hadoop fs -lsr! $ hadoop fs –mkdir notes! $ hadoop fs –put ~/training/commands.txt notes! $ hadoop fs –chmod 777 notes/commands.txt! $ hadoop fs –cat notes/commands.txt | more! $ hadoop fs –rm notes/*.txt! ! $ hadoop fs –put filenameSrc filenameDest! $ hadoop fs –put filename dirName/fileName! $ hadoop fs –cat foo! $ hadoop fs –get foo LocalFoo! $ hadoop fs –rmr directory|file! !
  • 11. © Hortonworks Inc. 2013 MapReduce
  • 12. © Hortonworks Inc. 2013 A Basic MapReduce Job map() implemented private final static IntWritable one = new IntWritable(1);! private Text word = new Text(); !! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! }! }!
  • 13. © Hortonworks Inc. 2013 A Basic MapReduce Job reduce() implemented ! private final IntWritable totalCount = new IntWritable();! ! public void reduce(Text key, ! !Iterator<IntWritable> values, OutputCollector<Text, !IntWritable> output, Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum += values.next().get();! }! ! totalCount.set(sum);! output.collect(key, totalCount);! }!
  • 15. © Hortonworks Inc. 2013 What is Pig? • Pig is an extension of Hadoop that simplifies the ability to query large HDFS datasets • Pig is made up of two main components: – A SQL-like data processing language called Pig Latin – A compiler that compiles and runs Pig Latin scripts • Pig was created at Yahoo! to make it easier to analyze the data in your HDFS without the complexities of writing a traditional MapReduce program • With Pig, you can develop MapReduce jobs with a few lines of Pig Latin
  • 16. © Hortonworks Inc. 2013 Running Pig A Pig Latin script executes in three modes: 1.  MapReduce: the code executes as a MapReduce application on a Hadoop cluster (the default mode) 2.  Local: the code executes locally in a single JVM using a local text file (for development purposes) 3.  Interactive: Pig commands are entered manually at a command prompt known as the Grunt shell $ pig myscript.pig $ pig -x local myscript.pig $ pig! grunt>
  • 17. © Hortonworks Inc. 2013 Understanding Pig Execution • Pig Latin is a data flow language • During execution each statement is processed by the Pig interpreter • If a statement is valid, it gets added to a logical plan built by the interpreter • The steps in the logical plan do not actually execute until a DUMP or STORE command
  • 18. © Hortonworks Inc. 2013 A Pig Example • The first three commands are built into a logical plan • The STORE command triggers the logical plan to be built into a physical plan • The physical plan will be executed as one or more MapReduce jobs logevents = LOAD ‘input/my.log’ AS (date, level, code, message); severe = FILTER logevents BY (level == ‘severe’ AND code >= 500); grouped = GROUP severe BY code; STORE grouped INTO ‘output/severeevents’;
  • 19. © Hortonworks Inc. 2013 Hive
  • 20. © Hortonworks Inc. 2013 What is Hive? • Hive is a subproject of the Apache Hadoop project that provides a data warehousing layer built on top of Hadoop • Hive allows you to define a structure for your unstructured big data, simplifying the process of performing analysis and queries by introducing a familiar, SQL-like language called HiveQL • Hive is for data analysts familiar with SQL who need to do ad-hoc queries, summarization and data analysis on their HDFS data
  • 21. © Hortonworks Inc. 2013 Hive is not… • Hive is not a relational database • Hive uses a database to store metadata, but the data that Hive processes is stored in HDFS • Hive is not designed for on-line transaction processing and does not offer real-time queries and row level updates
  • 22. © Hortonworks Inc. 2013 Pig vs. Hive • Pig and Hive work well together • Hive is a good choice: –  when you want to query the data – when you need an answer to a specific questions – if you are familiar with SQL • Pig is a good choice: – for ETL (Extract -> Transform -> Load) – preparing your data so that it is easier to analyze – when you have a long series of steps to perform • Many businesses use both Pig and Hive together
  • 23. © Hortonworks Inc. 2013 What is a Hive Table? • A Hive table consists of: – Data: typically a file or group of files in HDFS – Schema: in the form of metadata stored in a relational database • Schema and data are separate. – A schema can be defined for existing data – Data can be added or removed independently – Hive can be "pointed" at existing data • You have to define a schema if you have existing data in HDFS that you want to use in Hive
  • 24. © Hortonworks Inc. 2013 HiveQL • Hive’s SQL like language, HiveQL, uses familiar relational database concepts such as tables, rows, columns and schema • Designed to work with structured data • Converts SQL queries to into MapReduce jobs • Supports uses such as: – Ad-hoc queries – Summarization – Data Analysis
  • 25. © Hortonworks Inc. 2013 Running Jobs with the Hive Shell • Primary way people use to interact with Hive $ hive hive> • Can run in the shell in a non-interactive way $ hive –f myhive.q – Use –S option to have only the results show
  • 26. © Hortonworks Inc. 2013 Hive Shell - Information • At terminal enter: – $ hive • List all properties and values: – hive> set –v • List and describe tables – hive> show tables; – hive> describe <tablename>; – hive> describe extended <tablename>; • List and describe functions – hive> show functions; – hive> describe function <functionname>;
  • 27. © Hortonworks Inc. 2013 Hive Shell – Querying Data • Selecting Data – hive> SELECT * FROM students; – hive> SELECT * FROM students WHERE gpa > 3.6 SORT BY gpa ASC;
  • 28. © Hortonworks Inc. 2013 HiveQL • HiveQL is similar to other SQLs • User does not need to know Map/Reduce • HiveQL is based on the SQL-92 specification • Supports multi-table inserts via your code
  • 29. © Hortonworks Inc. 2013 Table Operations • Defining a table: hive> CREATE TABLE mytable (name chararray, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; • ROW FORMAT is a Hive-unique command that indicate that each row is comma delimited text • HiveQL statements are terminated with a semicolon ';' • Other table operations: – SHOW TABLES – CREATE TABLE – ALTER TABLE – DROP TABLE
  • 30. © Hortonworks Inc. 2013 SELECT • SELECT – Simple query example: – SELECT * FROM mytable; • Supports the following: – WHERE clause – ALL and DISTINCT – GROUP BY and HAVING – LIMIT clause – Rows returned are chosen at random – Can use REGEX Column Specification – Example: – SELECT ' (ds|hr)?+.+' FROM sales;
  • 31. © Hortonworks Inc. 2013 JOIN - Inner Joins •  Inner joins are implemented with ease: SELECT * FROM students; Steve 2.8 Raman 3.2 Mary 3.9 SELECT * FROM grades; 2.8 B 3.2 B+ 3.9 A SELECT students.*, grades.* FROM students JOIN grades ON (students.grade = grades.grade) Steve 2.8 2.8 B Raman 3.2 3.2 B+ Mary 3.9 3.9 A
  • 32. © Hortonworks Inc. 2013 JOIN – Outer Joins • Allows for the finding of rows with non-matches in the tables being joined • Outer Joins can be of three types – LEFT OUTER JOIN – Returns a row for every row in the first table entered – RIGHT OUTER JOIN – Returns a row for every row in the second table entered – FULL OUTER JOIN – Returns a row for every row from both tables
  • 33. © Hortonworks Inc. 2013 Sorting • ORDER BY – Sorts but sets the number of reducers to 1 • SORT BY – Multiple reducers with a sorted file from each
  • 34. © Hortonworks Inc. 2013 Hive Summary • Not suitable for complex machine learning algorithms • All jobs have a minimum overhead and will take time just for set up – Still Hadoop MapReduce on a cluster • Good for batch jobs on large amounts of append only data – Immutable filesystem – Does not support row level updates (except through file deletion or creation)
  • 35. © Hortonworks Inc. 2013 HCatalog
  • 36. © Hortonworks Inc. 2013 What is HCatalog? • Table management and storage management layer for Hadoop • HCatalog provides a shared schema and data type mechanism – Enables users with different data processing tools – Pig, MapReduce and Hive to have data interoperability –  HCatalog provides read and write interfaces for Pig, MapReduce and Hive to HDFS (or other data sources) – HCatalog’s data abstraction presents users with a relational view of data • Command line interface for data manipulation • Designed to be accessed though other programs such as Pig, Hive, MapReduce and HBase • HCatalog installs on top of Hive
  • 37. © Hortonworks Inc. 2013 HCatalog DDL • CREATE/ALTER/DROP Table • SHOW TABLES • SHOW FUNCTIONS • DESCRIBE • Many of the commands in Hive are supported – Any command which is not supported throws an exception and returns the message "Operation Not Supported".
  • 38. © Hortonworks Inc. 2013 Accessing HCatalog Metastore through CLI • Using HCatalog client – Execute a script file hcat –f “myscript.hcatalog” – Execute DDL hcat –e 'create table mytable(a int);'
  • 39. © Hortonworks Inc. 2013 Define a New Schema • A schema is defined as an HCatalog table: create table mytable ( id int, firstname string, lastname string ) comment 'An example of an HCatalog table' partitioned by (birthday string) stored as sequencefile;
  • 40. © Hortonworks Inc. 2013 Pig Specific HCatStorer Interface • Used with Pig scripts to write data to HCatalog managed tables. • Accepts a table to write to and optionally a specification of partition keys to create a new partition • HCatStorer is implemented on top of HCatOutputFormat – HCatStorer is accessed via a Pig store statement. – Storing into table partitioned on month, date, hour … STORE my_processed_data INTO ‘dbname.tablenmame’ USING org.apache.hcatalog.pig.HCatStorer(‘month=12, date=25, hour=0300’, ‘a:int,b:chararray,c:map[]’);
  • 41. © Hortonworks Inc. 2013 Thank You! Questions & Answers Page 41

Notes de l'éditeur

  1. Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configu- ration choice is to write to local disk as well as a remote NFS mount.
  2. Despite its name the SNN does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged name- space image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary. Edit LogWhen a filesystem client performs a write operation (such as creating or moving a file), it is first recorded in the edit log. The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified. The in-memory metadata is used to serve read requests. The edit log is flushed and synced after every write before a success code is returned to the client. For namenodes that write to multiple directories, the write must be flushed and synced to every copy before returning successfully. This ensures that no operation is lost due to machine failure. fsimageThe fsimagefile is a persistent checkpoint of the filesystem metadata. However, it is not updated for every filesystem write operation, since writing out the fsimagefile, which can grow to be gigabytes in size, would be very slow. This does not compromise resilience, however, because if the namenode fails, then the latest state of its metadata can be reconstructed by loading the fsimagefrom disk into memory, then applying each of the operations in the edit log. In fact, this is precisely what the namenode does when it starts up. According to Apache, SecondaryNameNode is deprecated. See: http://hadoop.apache.org/hdfs/docs/r0.22.0/hdfs_user_guide.html#Secondary+NameNode (Accessed May 2012). – LWSNN configuration options:dfs.namenode.checkpoint.period valuedfs.namenode.checkpoint.size valuedfs.http.addressvaluepoint to name nodes port 50070Gets Fsimage and EditLogdfs.namenode.checkpoint.dirvalueelse defaults to /tmp
  3. Data Disk Failure, Heartbeats and Re-ReplicationEach DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.Cluster RebalancingThe HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.Data IntegrityIt is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.Metadata Disk FailureThe fsImage and the editLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.
  4. Use the put commandIf bar is a directory then foo will be placed in the directory barIf no file or directory exists of name bar, file foo will be named If bar is already a file in HDFS then an error is returnedIf subNamedoes not exist then the file will be created in the directoryNote: -copyFromLocalcan be used instead of -put
  5. Can display using catCopy a file from HDFS to the local file system with –get command$ bin/hadoopfs –get foo LocalFooOther commands$ hadoopfs -rmrdirectory|file Recursively
  6. A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system (HDFS). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasksrunning on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.M/R supports multiple nodes processing contiguous data sets in parallel. Often, one node will “Map”, while another “Reduces”. In between, is the “shuffle”. We’ll cover all these in more detail.Basic “phases”, or steps, are displayed here. Other custom phases may be added.InMapReduce data is passed around as key/value pairs.QUESTION: is the “Reduce” phase required? NOINSTRUCTOR SUGGESTION: During the second day start – have one or more students re-draw this on the white board to re-iterate its importance.
  7. MapReduce consists of Java classes that can process big data in parallel processes. It receives three inputs, a source collection, a Map function and a Reduce function, then returns a new result data collection.The algorithm is composed of a few steps, the first one executes the map() function to each item within the source collection. Map will return zero or may instances of Key/Value objects.Map’s responsibility is to convert an item from the source collection to zero or many instances of Key/Value pairs.In the next step , an algorithm will sort all Key/Value pairs to create new object instances where all values will be grouped by Key.reduce() then groups each by Key/Value instance. It returns a new instance to be included into the result collection.
  8. A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.An InputFormat determines how the data is split across map tasksTextInputFormat is the default InputFormatIt divides the input data into an InputSplit[] arrayEach InputSplit is given an individual mapEach InputSplit is associated with a destinationnodethat should be used to read the data so Hadoop can try to assign the computation to local data locationThe RecordReader class:Reads input data and produces key-value pairs to be passed into the MapperMay control how data is decompressedConverts data to Java types that MapReduce can work withMapReduce relies on the InputFormat to:Validate the input-specification of the job.Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.The default behavior of InputFormatis to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. [However, HDFS blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via the config parameter mapred.min.split.size.] * validate thisLogical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.If the defaultTextInputFormat is used for a given job, the framework detects compressed inputfiles (i.e., with the .gz extension)and automatically decompresses them using an appropriate CompressionCodec. Note that, depending on the codec used, compressed files cannot be split and each compressed file is processed in its entirety by a single mapper.gzip cannot be split, bzip and lzocan.
  9. Mapping is often used for typical “ETL” processes, such as filtering, transformation, and categorization.Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.The OutputCollector is provided by theframework to collect data output by the Mapper or the Reducer.The Reporter class is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive. In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).Applications can also update Counters using the Reporter.Parallel Map ProcessesThe number of maps is usually driven by the number of inputs, orthe total number of blocks in the input files.The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.If you expect 10TB of input data and have a block size of 128MB, you&apos;ll end up with 82,000 maps, unless you use setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.
  10. The Partitioner controls the balancing of the keys of intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function on the key (or portion of it)AllMapper outputs are sorted and then partitioned per ReducerThe total number of partitions is the same as the number of Reduce tasks for the job. ThePartitioner is like a load balancer, that controls which one of the Reduce tasks the intermediate key (and hence the record) is sent to for reduction.Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer to determine final output. Users can control that grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).MapReduce comes bundled with a library of generally useful Mappers, Reducers, and Partitioner classes.
  11. MapReduce partitions data among indiidual Reducers, usually running on separate nodes.The Shuffle phase is determined by how a Partitioner (embedded or custom) partitions value pairs to ReducersThe shuffle is where refinements and improvements are continually being made, In many ways, the shuffle is the heart of MapReduce and is where the “magic” happens.
  12. The Sort phase guarantees that the input to every Reducer is sorted by key.
  13. ARecuder extends MapReduceBase(). It implements the Reducer interfaceReceives output from multiple mappersreduce() method is called for each &lt;key, (list of values)&gt; pair in the grouped inputsIt sorts data by key of key/value pairTypically iterates through the values associated with a keyIt is perfectly legal to set the number of reduce-tasks to zero if no reduction is desired.In this case the outputs of the map-tasks go directly to HDFS, to the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out toHDFS.Reducer has 3 primary phases: Shuffle, Sort and Reduce:ShuffleInput to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.SortThe framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.Secondary SortIf equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.ReduceIn this phase the reduce() method is called for each &lt;key, (list of values)&gt; pair in the grouped inputs.The output of the reduce task is typically written to HDFS via OutputCollector.collect(WritableComparable, Writable).Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.The output of the Reducer is not sorted.
  14. OutputFormat has two responsibilities: determining where and how data is written. Itdetermines where by examining the JobConf and checking to make sure it is a legitimate destination. The &apos;how&apos; is handled by the getRecordWriter() function.From Hadoop&apos;s point of view, these classes come together when the MapReduce job finishes and each reducer has produced a stream of key-value pairs. Hadoop calls checkOutputSpecs with the job&apos;s configuration. If the function runs without throwing an Exception, it moves on and calls getRecordWriter which returns an object which can write that stream of data. When all of the pairs have been written, Hadoop calls the close function on the writer, committing that data to HDFS and finishing the responsibility of that reducer.MapReducerelies on the OutputFormat of the job to:Validate the output-specification of the job; for example, check that the output directory doesn&apos;t already existProvide the RecordWriter implementation used to write the output files of the job. Output files are stored in a FileSystem (usually HDFS)TextOutputFormat is the default OutputFormat.OutputFormats specify how to serialize data by providing a implementation of RecordWriter. RecordWriter classes handle the job of taking an individual key-value pair and writing it to the location prepared by the OutputFormat. There are two main functions to a RecordWriter implements: &apos;write&apos; and &apos;close&apos;. The &apos;write&apos; function takes key-values from the MapReduce job and writes the bytes to disk. The default RecordWriter is LineRecordWriter, part of the TextOutputFormat mentioned earlier. It writes:The key&apos;s bytes (returned by the getBytes() function)a tab character delimiterthe value&apos;s bytes (again, produced by getBytes())a newline character.The &apos;close&apos; function closes the Hadoop data stream to the output file.We&apos;ve talked about the format of output data, but where is it stored? Again, you&apos;ve probably seen the output of a job stored in many ‘part&apos; files under the output directory like so:|-- output-directory| |-- part-00000| |-- part-00001| |-- part-00002| |-- part-00003| |-- part-00004 &apos;-- part-00005from http://www.infoq.com/articles/HadoopOutputFormat
  15. Mostly &quot;boilerplate&quot; with fill in the blank for datatypesHadoopdatatypes have well defined methods to get/put values into themStandard pattern to declare your output objects outside map() loop scopeStandard pattern to 1) get Java datatypes 2) Do your logic 3) Put into HadoopdatatypesThis map() method is called in a &quot;loop&quot; with successive key, value pairs.Each time through, you typically write key, value pairs to the reduce phaseThe key, value pairs go to the Hadoop Framework where the key is hashed to choose a reducer
  16. In the Reducer:Reduce code is copied to multiple nodes in the clusterThe copies are identicalAll key,value pairs with identical keys are placed into a pair of (key, Iterator&lt;values&gt;) when passed to the reducer.Each invocation of the reduce() method is passed all values associated with a particular key.The keys are guaranteed to arrive in sorted orderEach incoming value is a numeric count of words in a particular line of textThe multiple values represent multiple lines of text processed by multiple map() methodsThe values are in an Iterator, and are summed in a loopThe sum, with its associated key, is sent to a disk file via the Hadoop FrameworkThere are typically many copies of your reduce codeIncoming key,value pairs are of the datatypes that map emitsOutput key,value pairs go to the HDFS, which writes them to diskThe Hadoop Framework constructs the Iterator from map values that are sent to this Reducer
  17. An inverted index is a data structure that stores a mapping from content to its locations in a database file, or in a document or a set of documentsCan provide what to display for a searchNOTE: need to deal with OutputCollector() here.
  18. First the files are partitioned in parts that will be distributed to process across the cluster nodesEach part is parsed in pairs of Key(sorteable object) - Value(object), that will be the input parameters for the tasks implementing the Map functionThese user defined tasks (map), will read the value object, do something with it, and then build a new key-value list that will be stored by the framework, in intermediate files.Once all the map tasks are finished, it means that the whole data to process was completely read, and reordered into this mapreduce model of key-value paris.These intermediate key-value results are combined, resulting a new paris of key-value that will be the input for the next reduce tasksThese user defined tasks (reduce), will read the value object, do something with it, and then produce the 3rd and last list of key-value pairs, that the framework will combine, and regroup into a final result.from http://hadooper.blogspot.com/
  19. The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.JobTracker also:Checks jobspecificationsCalculates the InputSplit[]’s for the jobInitiates the DistributedCache of the job, if there is oneDeploys the runtime jar and configurations to the system directory on HDFSSubmits the job to the JobTrackerMonitors the job
  20. TaskTracker is a node in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from theJobTrackerTaskTracker is configured with a set of slots. Slots define the boundary of number of tasks that TaskTracker can acceptWhen JobTracker attempts to schedule a task, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rackTaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take TaskTracker downTaskTracker monitors spawned processes, capturing output and exit codesWhen the process finishes, TaskTracker notifies JobTracker as to success or failureTaskTracker also sends out “I’m alive” heartbeat messages to JobTrackerregularlyto notify JobTrackerThe included message in the heartbeat also tells JobTracker the number of available slots, soJobTracker knows which nodes are available for new tasks
  21. You can chain MapReduce jobs to accomplish complex tasks which cannot be completed with a single job This is fairly easy since the output of the job typically goes to HDFS. In this way the output can be used as the input for the next sequential job.Remember that ensuring jobs are complete (success/failure) is a client responsibilityJob-control options are:runJob() : Submits the job and returns only after the job has completed.submitJob() : Only submits the job, then poll the returned handle to the RunningJob to query status and make scheduling decisions.JobConf.setJobEndNotificationURI() : Sets up a notification upon job-completion, thus avoiding polling
  22. -Dproperty=valueAllows for the setting of values at the command line to overide default site properties and properties set with -conf option (e.g. numReduceTasks)-confAdds an XML formatted property file to your job-fsuriShortcut for –Dfs.default.name=uri-filesCopies files from the local client filesystem to the local filesystem on the cluster nodes. Distributed Cache is used.
  23. GenericOptionsParserHadoop comes with a few helper classes for making it easier to run jobs from the command line. GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your application to use as desired. You don’t usually use GenericOptionsParser directly, as it’s more convenient to implement the Tool interface and run your application with the ToolRunner, which uses GenericOptionsParser internally.GenericOptionsParser also allows you to set individual properties. For example: $ hadoopConfigurationPrinter -D color=yellow | grep color color=yellow The -D option is used to set the configuration property with key color to the value yellow. Options specified with -D take priority over properties from the configuration files. This is very useful: you can put defaults into configuration files and then override them with the -D option as needed. A common example of this is setting the number of reducers for a MapReduce job via -D mapred.reduce.tasks=nThis will override the number of reducers set on the cluster or set in any client-side configuration files.
  24. JobConf represents a MapReduce job configurationIt sets up properties associated with a particular Job submissionIt reads a Configuration() instance to get some of those propertiesIt assigns a default name to the JobIt specifies key &amp; value datatypes output from mapper() and datatypes output from reducer().JobConf is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. JobConf is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat, OutputFormat and OutputCommitter implementationsIt also indicates the set of input files (setInputPaths(JobConf, Path...) /addInputPath(JobConf, Path)) and (setInputPaths(JobConf, String) /addInputPaths(JobConf, String)) and where the output files should be written (setOutputPath(Path))JobConf is optionally used to specify other advanced facets of the job such as the Comparator to be used, files to be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed (and how), debugging via user-provided scripts (setMapDebugScript(String)/setReduceDebugScript(String)) , whether job tasks can be executed in a speculative manner (setMapSpeculativeExecution(boolean))/(setReduceSpeculativeExecution(boolean)) , maximum number of attempts per task (setMaxMapAttempts(int)/setMaxReduceAttempts(int)) , percentage of tasks failure which can be tolerated by the job (setMaxMapTaskFailuresPercent(int)/setMaxReduceTaskFailuresPercent(int)) etc.Of course, users can use set(String, String)/get(String, String) to set/get arbitrary parameters needed by applications. NOTE: for performance, use the DistributedCache for large amounts of (read-only) data
  25. See more at http://hadoop.apache.org/common/docs/current/api/index.html?overview-summary.html
  26. Reference Task Tracker slidesOne of the primary reasons to use Hadoop to run your jobs is due to its high degree of fault tolerance. Even when running jobs on a large cluster where individual nodes or network components may experience high rates of failure, Hadoop can guide jobs toward a successful completion.The primary way that Hadoop achieves fault tolerance is through restarting tasks. Individual task nodes (TaskTrackers) are in constant communication with the head node of the system, called the JobTracker. If a TaskTracker fails to communicate with the JobTracker for a period of time (by default, 1 minute), the JobTracker will assume that the TaskTracker in question has crashed. The JobTracker knows which map and reduce tasks were assigned to each TaskTracker.If the job is still in the mapping phase, then other TaskTrackers will be asked to re-execute all map tasks previously run by the failed TaskTracker. If the job is in the reducing phase, then other TaskTrackers will re-execute all reduce tasks that were in progress on the failed TaskTracker.Reduce tasks, once completed, have been written back to HDFS. Thus, if a TaskTracker has already completed two out of three reduce tasks assigned to it, only the third task must be executed elsewhere. Map tasks are slightly more complicated: even if a node has completed ten map tasks, the reducers may not have all copied their inputs from the output of those map tasks. If a node has crashed, then its mapper outputs are inaccessible. So any already-completed map tasks must be re-executed to make their results available to the rest of the reducing machines. All of this is handled automatically by the Hadoop platform.This fault tolerance underscores the need for program execution to be side-effect free. If Mappers and Reducers had individual identities and communicated with one another or the outside world, then restarting a task would require the other nodes to communicate with the new instances of the map and reduce tasks, and the re-executed tasks would need to reestablish their intermediate state. This process is notoriously complicated and error-prone in the general case. MapReduce simplifies this problem drastically by eliminating task identities or the ability for task partitions to communicate with one another. An individual task sees only its own direct inputs and knows only its own outputs, to make this failure and restart process clean and dependable.from http://developer.yahoo.com/hadoop/tutorial/module4.html#basics
  27. The default number of reducers is config dependent.The number of output files = number of reducersOne reducer implies one output fileTen reducers implies ten output filesIn your code you can specify the number of reducersIt is possible to specify zero reducers With a “map only job” the number of output files = number of mappersThe number of mappers = The number of input blocksYour code can implement inputSplit[] to set the number of mappers.
  28. Pig Latin is a data flow language – a sequence of steps where each step is an operation or commandDuring execution each statement is processed by the Pig interpreter and checked for syntax errorsIf a statement is valid, it gets added to a logical plan built by the interpreterHowever, the step does not actually execute until the entire script is processed, unless the step is a DUMP or STORE command, in which case the logical plan is compiled into a physical plan and executed
  29. After the LOAD statement, point out that nothing is actually loaded! The LOAD command only defines a relationThe FILTER and GROUP commands are fairly self explanatoryThe HDFS is not even touched until the STORE command executes, at which point a MapReduce application is built from the Pig Latin statements shown here.
  30. GruntPig’s interactive shellEnables users to enter Pig Latin interactivelyGrunt will do basic syntax and semantic checking as you enter each lineProvides a shell for users to interact with HDFSTo enter Grunt and use local file system instead:$ pig –x localIn the example script:A is called a relation or aliasimmutable, recreated if reusedmyfile is read from your home directory in HDFSA has no schema associated with itbytearrayLOAD uses default function PigStorage() to load dataAssumes TAB delimited in HDFSThe entire file &apos;myfile&apos; will be read&quot;pigs eat anything&quot;The elements in A can be referenced by position if no schema is associated$0, $1, $2, ...
  31. Pig return codes:(type in page 18 of Pig)
  32. The complex types can contain data of any type
  33. The DESCRIBE output is:describe employeesemployees: {name: chararray,age: bytearray,zip: int,salary: bytearray}
  34. Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java, Python, etc. In Pig a null data element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc. In most procedural languages, a data value is said to be null when it is unset or does not point to a valid address or object. This difference in the concept of null is important and affects the way Pig treats null data, especially when operating on it.from Programming Pig, Alan Gates 2011
  35. twokey.pig collects all records with the same value for the provided key together into a bagWithin a single relation, group together tuples with the same group keyCan use keywords BY and ALL with GROUPCan optionally be passed as an aggregate function (count.pig)Can group on multiple keys if they are surrounded by parentheses
  36. Web site click log processingUser activityDaily updatesBusiness IntelligenceAdvertising placementData mining (key words)
  37. Central repository of Hive Metadata regarding tables, partitions and databasesRuns in the same JVM as the Hive serviceServices metadata requests from clientsHive stores metadata in a standard relational databaseDefault database is Derby, an open source, in-memory, embedded SQL, which is run on the same machine as HiveSingle user databaseNeed to move to MySQL to support multiple usersMetastore database typically called metastore_dbHive supports remote metastoresMetastores running in separate processes to the Hive serviceReally just another way to submit queries using HiveAlternative to Command Line InterfaceSupports clients using different languagesExposes a Thrift serviceProvides APIs that can be used by other clients (such as JDBC drivers) to talk to Hive.Hive ClientsThrift ClientSupports using C++, Java, PHP, Python and RubyJDBC DriverODBC DriverClasses to Serialize and Deserialize Data SerDe is a short name for Serializer and DeserializerSerializing – converting data into a format for storageDeserializer – extracting data structure from a series of bytesHive uses SerDe to read from/write to tables.Framework libraries that allow users to develop serializers and deserializers for their own data formats. Contains some builtin serialization/deserialization families.
  38. You have to use the ANSI Standard JOIN syntax (ie. No where clause in a JOIN)
  39. Table formatsDefault table format without using ROW, FORMAT or STORED AS is delimited text, one row per lineControl-A is the default row delimiterDefault collection delimiter for an ARRAY, STRUCT or MAP is the Control-B characterDefault Map key delimiter is Control-C
  40. HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files.HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.SerDeSerDe is short for Serializer/Deserializer. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.For JSON files, Amazon has provided a JSON SerDe available at:s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jarHive uses the Serde interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.
  41. Slide 40 - all of Ecosystem – use these earlyDO THIS VERY EARLY!