SlideShare une entreprise Scribd logo
1  sur  24
Hadoop MR Streaming in Hive 
Use-case with Hive and Python from real life 
Yauheni Yushyn, EPAM Systems – September 2014
Agenda 
2 
• Intro 
• Pros and Cons 
• Hive reference 
• Use case from Real Life 
• Possible solutions 
• Hive Streaming: Architecture 
• Hive Streaming: Realization 
• Hive Streaming: Source code 
• Hive Streaming: Debug 
• Hive Streaming: Pitfalls 
• Hive Streaming: Benchmarks
SECTION 
Hadoop MR Streaming in Hive 
CONCEPTS 
3
Intro 
Streaming offers an alternative way to transform data. During a streaming job, the Hadoop 
Streaming API opens an I/O pipe to an external process 
Unix like interface: 
• Streaming API opens an I/O pipe to an external process 
• Process reads data from the standard input and writes the results out through the standard 
output 
By default, 
INPUT for user script: 
• columns transformed to STRING 
• delimited by TAB 
• NULL values converted to the literal string N (differentiate NULL values from empty strings) 
OUTPTUT of user script: 
• treated as TAB-separated STRING columns 
• N will be re-interpreted as a NULL 
• resulting STRING column will be cast to the data type specified in the table declaration 
These defaults can be overridden with ROW FORMAT
Pros and Cons 
• Simplicity for developer, dealing with stdin/stdout 
• Schema-less model, treat values as needed 
• Non-Java interface 
• Overhead for Serialization/Deserialization between 
processes 
• Disallowed when "SQL standard based 
authorization" is configured (Hive 0.13.0 and later 
releases)
Hive reference 
• MAP() 
• REDUCE() 
• TRANSFORM() 
Hive provides several clauses to use streaming: MAP(), REDUCE(), and 
TRANSFORM(). 
Note: 
MAP() does not actually force streaming during the map phase nor does 
reduce force streaming to happen in the reduce phase. For this reason, 
the functionally equivalent yet more generic TRANSFORM() clause is 
suggested to avoid misleading the reader of the query.
SECTION 
Hadoop MR Streaming in Hive 
USE CASE 
7
Use case from Real Life 
Requirements: 
There’re 14 flags in source table in Hive, which 
controls output values for 4 new fields in target table 
Solutions: 
• Hive "case … when" clause 
• User Defined Function (UDF) 
• Custom MR Job 
• Hive Streaming
Use case from Real Life: Requirements
Hive "case … when" clause 
• There’re more than 1,500 lines of code to 
map flags with new fields (statement 
repeats for every new output field) 
• Complexity for debugging 
• Fast execution 
• SQL-like syntax 
• All logic in one place (hql script) 
10
UDF 
• You are single consumer of UDF (for this particular 
case, custom logic for single DataMart) 
• Java-code 
• Fast execution 
• Pass only needed flags into UDF (in contrast with 
Hive Streaming) 
• In the final point: SQL-like syntax, All logic in one 
place 
• Java-code
Hive Streaming 
• Slower execution (time for SerDe) 
• Deal with all fields, not only flags (in contrast 
with UDF) 
• Reducing complexity of code using script 
language 
• Small size of code 
• Fast developing 
• Wide stack of programming languages
SECTION 
Hadoop MR Streaming in Hive 
REALIZATION 
13
Hive Streaming: Architecture
Hive Streaming: Realization 
Python snippets: 
• Create matrix (e.g., list of tuples) with flags and related values of fields 
• Loop through INPUT 
• Split INPUT by TAB 
• Split data fields and flags 
• Compare with matrix and get max possible matching 
• Spill out data with new fields as TAB separated text
Hive Streaming: Source code 
#!/usr/bin/env python 
"""Mapper for Hive Streaming, using Python iterators and generators.Spill out new fields in accordance with input flags.""" 
import sys 
import logging 
def read_input(file): 
"""Read data from STDIN using python generator""" 
#yield "IAHtCUNtIAH-CUN 
t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtN 
tNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01-01tEpam.COM" 
#yield "IAHtCUNtIAH-CUN 
t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtN 
tNtNtNtNtNtNtNtTRIPADVISOR - UStN1tN1tNtNtNtNtNt1tNtNtNt1tNtNtNtNtNtNtNtNtNt2014-01-01tEpam.COM" 
for line in file: 
yield line.strip() 
def compare_flags(source, target): 
"""Compare flags from source and target lists. Src/trg should have the same size""" 
size = len(source) 
out = list() 
# Go through elemets, add 0 to OUT list if src/trg elements equals 
for i in xrange(size): 
if target[i] != '-': 
if target[i] == source[i]: 
out.append(0) 
else: 
#logging.debug("Position: %i. Values of src/trg not equals, skip: %s,%s" % (i, source[i], target[i])) 
return None 
#out.append(1) 
else: 
out.append('-') 
return out 
def main(separator='t'): 
column_list = 
["ORIGIN","DESTINATION","OND","CARRIER","LOS","BKG_WINDOW","LOCAL_CURRENCY","LOWEST_PRICE","PAGE","POSITION","XP_RANK","XP_PRICE","XP_COMPETED", 
"XP_PRICE_DIFF","BML","NUMBER_SELLERS","XP_IS_HERO","ECPC_LOSS","PRICE_LOSS","OTA_1","OTA_1_PRICE","OTA_2","OTA_2_PRICE","OTA_3","OTA_3_PRICE","OT 
A_4","OTA_4_PRICE","OTA_5","OTA_5_PRICE","OTA_6","OTA_6_PRICE","OTA_7","OTA_7_PRICE","OTA_8","OTA_8_PRICE","OTA_9","OTA_9_PRICE","OTA_10","OTA_10_PRI 
CE","OTA_11","OTA_11_PRICE","OTA_12","OTA_12_PRICE","OTA_13","OTA_13_PRICE","OTA_14","OTA_14_PRICE","OTA_15","OTA_15_PRICE","PARTNER_NAME","RCXR"," 
DCXR","SPLIT_TICKET","DEPARTURE_DURATION","RETURN_DURATION","DEPARTURE_STOPS","RETURN_STOPS"] 
flag_list = 
["exp_listed_on_route_flag","exp_listed_on_carr_flag","exp_lst_on_itin_flag","carr_is_seller_flag","more_than_1_seller_flag","split_ticket_flag","exp_in_hero_flag","ota_in_hero_flag"," 
meta_in_hero_flag","carr_in_hero_flag","cheapest_prc_is_unique_flag","exp_prc_match_carr_flag","exp_prc_match_cheapest_flag","cheapest_ota_meta_prc_match_carr_flag"] 
partition_list = ["SHOP_DATE", "PARTNER_POS"] 
logging.debug("Star specifying vocabulary matrix") 
target = [ 
(["Inventory","Epam not showing route","Epam Lost","Unknown"],["0","-","0","-","-","-","-","-","-","-","-","-","-","-"]) 
,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","0","-","-","-","-","-","-","-","-"]) 
,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier for Epam"],["1","0","0","1","1","0","-","-","-","-","-","-","-","-"]) 
,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier on Meta"],["1","0","0","1","0","0","-","-","-","-","-","-","-","-"]) 
,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","-","-","-","-","-","-","-","-","-"]) 
,(["Inventory","Epam not showing itinerary","Epam Lost","Unknown"],["1","1","0","-","1","0","-","-","-","-","-","-","-","-"]) 
,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","0","0","0","0","1","-","-","-","-","-","-","-","-"]) 
,(["Inventory","Unique Inventory","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) 
,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","1","0","0","-","-","-","-"]) 
,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","0","1","0","-","-","-","-"]) 
,(["Inventory","Unique Inventory","Unknown","Epam Won"],["1","1","1","0","0","0","1","0","0","0","-","-","-","-"]) 
,(["Inventory","Unique Inventory","Epam Lost","Suspected Carrier Restricted Content"],["1","1","0","1","0","0","0","0","0","1","-","-","-","-"]) 
,(["Inventory","Unique Inventory","Epam Lost","Unknown"],["1","1","0","0","0","0","-","-","-","-","-","-","-","-"]) 
,(["Price","Carrier more expensive","Undercutting carrier","Epam Won"],["1","1","1","1","-","0","1","0","0","0","-","0","-","-"]) 
,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","0","1","0","-","-","0","0"]) 
,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","1","0","0","-","-","0","0"]) 
,(["Price","Carrier cheapest","Epam Lost","Unknown"],["1","1","1","1","1","0","0","-","-","1","-","0","0","0"]) 
,(["Price","Carrier cheapest","Epam Lost","Carrier controlled pricing"],["1","1","1","1","-","0","0","0","0","1","1","0","-","-"]) 
,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","0","1","0","-","-","0","-"]) 
,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","1","0","0","-","-","0","-"]) 
,(["Price","Fees or charges","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) 
,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","0","1","0","-","-","0","-"]) 
,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","1","0","0","-","-","0","-"]) 
,(["Price","Fees or charges","Unknown","Epam Won"],["1","1","1","0","-","0","1","0","0","0","1","-","-","-"]) 
,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","1","0","0","0","-","0","1"]) 
,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","1","0","0","-","0","1"]) 
,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","0","1","0","0","-","1"]) 
,(["Rank","Rank","ECPC","Epam Won"],["1","1","1","-","1","-","1","0","0","0","0","-","1","-"]) 
,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","1","0","0","0","-","1","-"]) 
,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","1","0","0","-","1","-"]) 
,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","0","1","0","-","1","-"]) 
] 
# Input comes from STDIN 
data = read_input(sys.stdin) 
header_list = column_list + flag_list + partition_list 
logging.debug("Header for input data: %s" % header_list) 
logging.debug("Start reading from STDIN") 
# Loop through STDIN 
for words in data: 
#for words in sys.stdin: 
logging.debug("-----------") 
current_flags = list() 
#words = words.strip() 
words = words.split('t') 
logging.debug("Input values from external process (STDIN): %s" % words) 
logging.debug("Input length: %s" % len(words)) 
if (len(header_list) != len(words)): 
logging.error("Length of IN data (%i) not equal Header length (%i)! Exit" % (len(words), len(header_list))) 
sys.exit(1) 
data_set = dict(zip(header_list, words)) 
logging.debug("Parsing of STDIN: %s" % data_set) 
# Get flags 
for flag in flag_list: 
current_flags.append(data_set[flag]) 
logging.debug("Find flags: %s" % current_flags) 
# Get list with result of comparison src/trg 
compared_list = list() 
logging.debug("Comparing flags with vocabulary...") 
for k,v in target: 
#logging.debug("key, value: %s,%s" % (k, v)) 
temp_out = compare_flags(current_flags,v) 
if not temp_out: 
continue 
logging.debug("Match is found: %s" % temp_out) 
compared_list.append((k, temp_out)) 
temp_out = list() 
logging.debug("Comparing flags with vocabulary finished. List of matches: %s" % (compared_list)) 
# Find max occurrence of src in trg (find max-occurrence of zeros) 
max_zeros = 0 
out_fields = list() 
max_flag_from_trg =list() 
for k, v in compared_list: 
#logging.debug("key, value: %s,%s" % (k, v)) 
count_zero = v.count(0) 
if count_zero > max_zeros: 
out_fields = k 
max_flag_from_trg = v 
if (not out_fields) or (not max_flag_from_trg): 
logging.warning("Can't find values in vocabulary. Set values for DEFAULT") 
logging.warning("Fields: %s" % out_fields) 
logging.warning("Flags: %s" % max_flag_from_trg) 
out_fields = ["DEFAULT" for x in xrange(len(target[0][0]))] 
else: 
logging.debug("Output fields found") 
logging.debug("Fields: %s" % out_fields) 
logging.debug("Flags: %s" % max_flag_from_trg) 
# Output fields with flags in STDOUT 
field_data = [data_set[x] for x in column_list] 
partition_date = [data_set[x] for x in partition_list] 
out_row = separator.join(field_data) + separator + separator.join(out_fields) + separator + separator.join(partition_date) 
logging.debug("Output string: %s" % out_row) 
print out_row 
#print "%s%s%s%s%s" % (separator.join(field_data), separator, separator.join(out_fields), separator, separator.join(partition_date)) 
if __name__ == "__main__": 
logging.basicConfig(level=logging.DEBUG, stream=sys.stderr, 
#format='%(filename)s[LINE:%(lineno)d]# %(levelname)-8s [%(asctime)s] %(message)s' 
format='[%(asctime)s][%(filename)s][%(levelname)s] %(message)s' 
) 
main()
Hive Streaming: Debug 
echo -e “val11tval12t…val1Nnval21tval22t…val2N"| 
./script_name.py 
Example: 
Put 2 lines (TSV) in stdin 
echo -e "IAHtCUNtIAH-CUN 
t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtN 
tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01- 
01tEPAM.COMnIAHtCUNtIAH-CUN 
t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtN 
tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01- 
01tEPAM.COM“ | ./script_name.py 
Get 2 lines with new fields (without flags) in stdout 
IAH CUN IAH-CUN 01 14 USD 520.99 4 19 N N 0 N DID 2 DID DID DID CHEAPTICKETS 
520.99 ORBITZ 520.99 N N N N N N N N N N N N N N N N N N N N 
N N N N N N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier 
Epam Lost Unknown 2014-01-01 EPAM.COM 
IAH CUN IAH-CUN 01 14 USD 520.99 4 19 N N 0 N DID 2 DID DID DID CHEAPTICKETS 
520.99 ORBITZ 520.99 N N N N N N N N N N N N N N N N N N N N 
N N N N N N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier 
Epam Lost Unknown 2014-01-01 EPAM.COM
Hive Streaming: Pitfalls 
• Add script in Distributed Cash before running query with Hive Streaming 
• Use last columns in select statement for Dynamic Partitioning 
• Use more robust separator (default, TAB) to prevent inconsistency of data 
Note: always use iterator/generator (python methodology) functions instead of 
explicit reading from stdin! It saves system resources and executes script much 
faster (more over than 10 times) 
Example: 
def read_input(file): 
for line in file: 
# split the line into words 
yield line.strip() 
data = read_input(sys.stdin) 
for words in data: 
… 
for words in sys.stdin: 
…
SECTION 
Hadoop MR Streaming in Hive 
BENCHMARKS 
19
Hive Streaming: 
Benchmarks 
Hive "case … when" 
clause 
Source: 
MANAGED, Non-partitioned, 
2M rows 
Target: 
MANAGED, Non-Partitioned 
Time spent: 2m39s
Hive Streaming: 
Benchmarks 
Hive Streaming 
Source: 
MANAGED, Non-partitioned, 
2M rows 
Target: 
MANAGED, Non-Partitioned 
Time spent: 4m53s 
Note: no compression for output, so “Number of bytes written extremely larger
Hive Streaming: 
Benchmarks 
Hive "case … when" clause 
Source: 
MANAGED, Non-partitioned, 
2M rows 
Target: 
MANAGED, Partitioned by 2 columns 
Time spent: 2m44s
Hive Streaming: 
Benchmarks 
Hive Streaming 
Source: 
MANAGED, Non-partitioned, 
2M rows 
Target: 
MANAGED, Partitioned by 2 columns 
Time spent: 5m12s
Thanks! 
Join us at 
https://www.linkedin.com/groups/Belarus- 
Hadoop-User-Group-BHUG-8104884 
yauheni_yushyn@epam.com 
skype: ushin.evgenij

Contenu connexe

En vedette

Finite State Machines and C++
Finite State Machines and C++Finite State Machines and C++
Finite State Machines and C++Klika Tech, Inc
 
Writing Scalable React Applications: Introduction
Writing Scalable React Applications: IntroductionWriting Scalable React Applications: Introduction
Writing Scalable React Applications: IntroductionKlika Tech, Inc
 
How to Write UI Automated Tests
How to Write UI Automated TestsHow to Write UI Automated Tests
How to Write UI Automated TestsKlika Tech, Inc
 
jQuery Anti-Patterns for Performance & Compression
jQuery Anti-Patterns for Performance & CompressionjQuery Anti-Patterns for Performance & Compression
jQuery Anti-Patterns for Performance & CompressionPaul Irish
 
Organization of Automated Testing
Organization of Automated TestingOrganization of Automated Testing
Organization of Automated TestingKlika Tech, Inc
 
CAP theorem and distributed systems
CAP theorem and distributed systemsCAP theorem and distributed systems
CAP theorem and distributed systemsKlika Tech, Inc
 
[Tech Talks] Typesafe Stack Introduction
[Tech Talks] Typesafe Stack Introduction[Tech Talks] Typesafe Stack Introduction
[Tech Talks] Typesafe Stack IntroductionKlika Tech, Inc
 
An Overview of HTML5 Storage
An Overview of HTML5 StorageAn Overview of HTML5 Storage
An Overview of HTML5 StoragePaul Irish
 
Introduction to Serverless
Introduction to ServerlessIntroduction to Serverless
Introduction to ServerlessNikolaus Graf
 
NGINX Microservices Reference Architecture: Ask Me Anything
NGINX Microservices Reference Architecture: Ask Me AnythingNGINX Microservices Reference Architecture: Ask Me Anything
NGINX Microservices Reference Architecture: Ask Me AnythingNGINX, Inc.
 
React + Redux Introduction
React + Redux IntroductionReact + Redux Introduction
React + Redux IntroductionNikolaus Graf
 
Docker for Java Developers
Docker for Java DevelopersDocker for Java Developers
Docker for Java DevelopersNGINX, Inc.
 
React JS and why it's awesome
React JS and why it's awesomeReact JS and why it's awesome
React JS and why it's awesomeAndrew Hull
 
Docker 101: Introduction to Docker
Docker 101: Introduction to DockerDocker 101: Introduction to Docker
Docker 101: Introduction to DockerDocker, Inc.
 

En vedette (17)

Finite State Machines and C++
Finite State Machines and C++Finite State Machines and C++
Finite State Machines and C++
 
Writing Scalable React Applications: Introduction
Writing Scalable React Applications: IntroductionWriting Scalable React Applications: Introduction
Writing Scalable React Applications: Introduction
 
How to Write UI Automated Tests
How to Write UI Automated TestsHow to Write UI Automated Tests
How to Write UI Automated Tests
 
jQuery Anti-Patterns for Performance & Compression
jQuery Anti-Patterns for Performance & CompressionjQuery Anti-Patterns for Performance & Compression
jQuery Anti-Patterns for Performance & Compression
 
Organization of Automated Testing
Organization of Automated TestingOrganization of Automated Testing
Organization of Automated Testing
 
CAP theorem and distributed systems
CAP theorem and distributed systemsCAP theorem and distributed systems
CAP theorem and distributed systems
 
[Tech Talks] Typesafe Stack Introduction
[Tech Talks] Typesafe Stack Introduction[Tech Talks] Typesafe Stack Introduction
[Tech Talks] Typesafe Stack Introduction
 
An Overview of HTML5 Storage
An Overview of HTML5 StorageAn Overview of HTML5 Storage
An Overview of HTML5 Storage
 
Introduction to Serverless
Introduction to ServerlessIntroduction to Serverless
Introduction to Serverless
 
NGINX Microservices Reference Architecture: Ask Me Anything
NGINX Microservices Reference Architecture: Ask Me AnythingNGINX Microservices Reference Architecture: Ask Me Anything
NGINX Microservices Reference Architecture: Ask Me Anything
 
React + Redux Introduction
React + Redux IntroductionReact + Redux Introduction
React + Redux Introduction
 
Docker for Java Developers
Docker for Java DevelopersDocker for Java Developers
Docker for Java Developers
 
React JS and why it's awesome
React JS and why it's awesomeReact JS and why it's awesome
React JS and why it's awesome
 
React js
React jsReact js
React js
 
Telenor Connexion
Telenor Connexion Telenor Connexion
Telenor Connexion
 
Partnering with AWS
Partnering with AWSPartnering with AWS
Partnering with AWS
 
Docker 101: Introduction to Docker
Docker 101: Introduction to DockerDocker 101: Introduction to Docker
Docker 101: Introduction to Docker
 

Similaire à EPAM. Hadoop MR streaming in Hive

Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016Ehsan Totoni
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Toying with spark
Toying with sparkToying with spark
Toying with sparkRaymond Tay
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 

Similaire à EPAM. Hadoop MR streaming in Hive (20)

Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
test
testtest
test
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Scalding intro 20141125
Scalding intro 20141125Scalding intro 20141125
Scalding intro 20141125
 

Dernier

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 

Dernier (20)

2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 

EPAM. Hadoop MR streaming in Hive

  • 1. Hadoop MR Streaming in Hive Use-case with Hive and Python from real life Yauheni Yushyn, EPAM Systems – September 2014
  • 2. Agenda 2 • Intro • Pros and Cons • Hive reference • Use case from Real Life • Possible solutions • Hive Streaming: Architecture • Hive Streaming: Realization • Hive Streaming: Source code • Hive Streaming: Debug • Hive Streaming: Pitfalls • Hive Streaming: Benchmarks
  • 3. SECTION Hadoop MR Streaming in Hive CONCEPTS 3
  • 4. Intro Streaming offers an alternative way to transform data. During a streaming job, the Hadoop Streaming API opens an I/O pipe to an external process Unix like interface: • Streaming API opens an I/O pipe to an external process • Process reads data from the standard input and writes the results out through the standard output By default, INPUT for user script: • columns transformed to STRING • delimited by TAB • NULL values converted to the literal string N (differentiate NULL values from empty strings) OUTPTUT of user script: • treated as TAB-separated STRING columns • N will be re-interpreted as a NULL • resulting STRING column will be cast to the data type specified in the table declaration These defaults can be overridden with ROW FORMAT
  • 5. Pros and Cons • Simplicity for developer, dealing with stdin/stdout • Schema-less model, treat values as needed • Non-Java interface • Overhead for Serialization/Deserialization between processes • Disallowed when "SQL standard based authorization" is configured (Hive 0.13.0 and later releases)
  • 6. Hive reference • MAP() • REDUCE() • TRANSFORM() Hive provides several clauses to use streaming: MAP(), REDUCE(), and TRANSFORM(). Note: MAP() does not actually force streaming during the map phase nor does reduce force streaming to happen in the reduce phase. For this reason, the functionally equivalent yet more generic TRANSFORM() clause is suggested to avoid misleading the reader of the query.
  • 7. SECTION Hadoop MR Streaming in Hive USE CASE 7
  • 8. Use case from Real Life Requirements: There’re 14 flags in source table in Hive, which controls output values for 4 new fields in target table Solutions: • Hive "case … when" clause • User Defined Function (UDF) • Custom MR Job • Hive Streaming
  • 9. Use case from Real Life: Requirements
  • 10. Hive "case … when" clause • There’re more than 1,500 lines of code to map flags with new fields (statement repeats for every new output field) • Complexity for debugging • Fast execution • SQL-like syntax • All logic in one place (hql script) 10
  • 11. UDF • You are single consumer of UDF (for this particular case, custom logic for single DataMart) • Java-code • Fast execution • Pass only needed flags into UDF (in contrast with Hive Streaming) • In the final point: SQL-like syntax, All logic in one place • Java-code
  • 12. Hive Streaming • Slower execution (time for SerDe) • Deal with all fields, not only flags (in contrast with UDF) • Reducing complexity of code using script language • Small size of code • Fast developing • Wide stack of programming languages
  • 13. SECTION Hadoop MR Streaming in Hive REALIZATION 13
  • 15. Hive Streaming: Realization Python snippets: • Create matrix (e.g., list of tuples) with flags and related values of fields • Loop through INPUT • Split INPUT by TAB • Split data fields and flags • Compare with matrix and get max possible matching • Spill out data with new fields as TAB separated text
  • 16. Hive Streaming: Source code #!/usr/bin/env python """Mapper for Hive Streaming, using Python iterators and generators.Spill out new fields in accordance with input flags.""" import sys import logging def read_input(file): """Read data from STDIN using python generator""" #yield "IAHtCUNtIAH-CUN t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtN tNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01-01tEpam.COM" #yield "IAHtCUNtIAH-CUN t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtN tNtNtNtNtNtNtNtTRIPADVISOR - UStN1tN1tNtNtNtNtNt1tNtNtNt1tNtNtNtNtNtNtNtNtNt2014-01-01tEpam.COM" for line in file: yield line.strip() def compare_flags(source, target): """Compare flags from source and target lists. Src/trg should have the same size""" size = len(source) out = list() # Go through elemets, add 0 to OUT list if src/trg elements equals for i in xrange(size): if target[i] != '-': if target[i] == source[i]: out.append(0) else: #logging.debug("Position: %i. Values of src/trg not equals, skip: %s,%s" % (i, source[i], target[i])) return None #out.append(1) else: out.append('-') return out def main(separator='t'): column_list = ["ORIGIN","DESTINATION","OND","CARRIER","LOS","BKG_WINDOW","LOCAL_CURRENCY","LOWEST_PRICE","PAGE","POSITION","XP_RANK","XP_PRICE","XP_COMPETED", "XP_PRICE_DIFF","BML","NUMBER_SELLERS","XP_IS_HERO","ECPC_LOSS","PRICE_LOSS","OTA_1","OTA_1_PRICE","OTA_2","OTA_2_PRICE","OTA_3","OTA_3_PRICE","OT A_4","OTA_4_PRICE","OTA_5","OTA_5_PRICE","OTA_6","OTA_6_PRICE","OTA_7","OTA_7_PRICE","OTA_8","OTA_8_PRICE","OTA_9","OTA_9_PRICE","OTA_10","OTA_10_PRI CE","OTA_11","OTA_11_PRICE","OTA_12","OTA_12_PRICE","OTA_13","OTA_13_PRICE","OTA_14","OTA_14_PRICE","OTA_15","OTA_15_PRICE","PARTNER_NAME","RCXR"," DCXR","SPLIT_TICKET","DEPARTURE_DURATION","RETURN_DURATION","DEPARTURE_STOPS","RETURN_STOPS"] flag_list = ["exp_listed_on_route_flag","exp_listed_on_carr_flag","exp_lst_on_itin_flag","carr_is_seller_flag","more_than_1_seller_flag","split_ticket_flag","exp_in_hero_flag","ota_in_hero_flag"," meta_in_hero_flag","carr_in_hero_flag","cheapest_prc_is_unique_flag","exp_prc_match_carr_flag","exp_prc_match_cheapest_flag","cheapest_ota_meta_prc_match_carr_flag"] partition_list = ["SHOP_DATE", "PARTNER_POS"] logging.debug("Star specifying vocabulary matrix") target = [ (["Inventory","Epam not showing route","Epam Lost","Unknown"],["0","-","0","-","-","-","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier for Epam"],["1","0","0","1","1","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier on Meta"],["1","0","0","1","0","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","-","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing itinerary","Epam Lost","Unknown"],["1","1","0","-","1","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","0","0","0","0","1","-","-","-","-","-","-","-","-"]) ,(["Inventory","Unique Inventory","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","1","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","0","1","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Unknown","Epam Won"],["1","1","1","0","0","0","1","0","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Suspected Carrier Restricted Content"],["1","1","0","1","0","0","0","0","0","1","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Unknown"],["1","1","0","0","0","0","-","-","-","-","-","-","-","-"]) ,(["Price","Carrier more expensive","Undercutting carrier","Epam Won"],["1","1","1","1","-","0","1","0","0","0","-","0","-","-"]) ,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","0","1","0","-","-","0","0"]) ,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","1","0","0","-","-","0","0"]) ,(["Price","Carrier cheapest","Epam Lost","Unknown"],["1","1","1","1","1","0","0","-","-","1","-","0","0","0"]) ,(["Price","Carrier cheapest","Epam Lost","Carrier controlled pricing"],["1","1","1","1","-","0","0","0","0","1","1","0","-","-"]) ,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","0","1","0","-","-","0","-"]) ,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","1","0","0","-","-","0","-"]) ,(["Price","Fees or charges","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) ,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","0","1","0","-","-","0","-"]) ,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","1","0","0","-","-","0","-"]) ,(["Price","Fees or charges","Unknown","Epam Won"],["1","1","1","0","-","0","1","0","0","0","1","-","-","-"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","1","0","0","0","-","0","1"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","1","0","0","-","0","1"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","0","1","0","0","-","1"]) ,(["Rank","Rank","ECPC","Epam Won"],["1","1","1","-","1","-","1","0","0","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","1","0","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","1","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","0","1","0","-","1","-"]) ] # Input comes from STDIN data = read_input(sys.stdin) header_list = column_list + flag_list + partition_list logging.debug("Header for input data: %s" % header_list) logging.debug("Start reading from STDIN") # Loop through STDIN for words in data: #for words in sys.stdin: logging.debug("-----------") current_flags = list() #words = words.strip() words = words.split('t') logging.debug("Input values from external process (STDIN): %s" % words) logging.debug("Input length: %s" % len(words)) if (len(header_list) != len(words)): logging.error("Length of IN data (%i) not equal Header length (%i)! Exit" % (len(words), len(header_list))) sys.exit(1) data_set = dict(zip(header_list, words)) logging.debug("Parsing of STDIN: %s" % data_set) # Get flags for flag in flag_list: current_flags.append(data_set[flag]) logging.debug("Find flags: %s" % current_flags) # Get list with result of comparison src/trg compared_list = list() logging.debug("Comparing flags with vocabulary...") for k,v in target: #logging.debug("key, value: %s,%s" % (k, v)) temp_out = compare_flags(current_flags,v) if not temp_out: continue logging.debug("Match is found: %s" % temp_out) compared_list.append((k, temp_out)) temp_out = list() logging.debug("Comparing flags with vocabulary finished. List of matches: %s" % (compared_list)) # Find max occurrence of src in trg (find max-occurrence of zeros) max_zeros = 0 out_fields = list() max_flag_from_trg =list() for k, v in compared_list: #logging.debug("key, value: %s,%s" % (k, v)) count_zero = v.count(0) if count_zero > max_zeros: out_fields = k max_flag_from_trg = v if (not out_fields) or (not max_flag_from_trg): logging.warning("Can't find values in vocabulary. Set values for DEFAULT") logging.warning("Fields: %s" % out_fields) logging.warning("Flags: %s" % max_flag_from_trg) out_fields = ["DEFAULT" for x in xrange(len(target[0][0]))] else: logging.debug("Output fields found") logging.debug("Fields: %s" % out_fields) logging.debug("Flags: %s" % max_flag_from_trg) # Output fields with flags in STDOUT field_data = [data_set[x] for x in column_list] partition_date = [data_set[x] for x in partition_list] out_row = separator.join(field_data) + separator + separator.join(out_fields) + separator + separator.join(partition_date) logging.debug("Output string: %s" % out_row) print out_row #print "%s%s%s%s%s" % (separator.join(field_data), separator, separator.join(out_fields), separator, separator.join(partition_date)) if __name__ == "__main__": logging.basicConfig(level=logging.DEBUG, stream=sys.stderr, #format='%(filename)s[LINE:%(lineno)d]# %(levelname)-8s [%(asctime)s] %(message)s' format='[%(asctime)s][%(filename)s][%(levelname)s] %(message)s' ) main()
  • 17. Hive Streaming: Debug echo -e “val11tval12t…val1Nnval21tval22t…val2N"| ./script_name.py Example: Put 2 lines (TSV) in stdin echo -e "IAHtCUNtIAH-CUN t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtN tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01- 01tEPAM.COMnIAHtCUNtIAH-CUN t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtN tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01- 01tEPAM.COM“ | ./script_name.py Get 2 lines with new fields (without flags) in stdout IAH CUN IAH-CUN 01 14 USD 520.99 4 19 N N 0 N DID 2 DID DID DID CHEAPTICKETS 520.99 ORBITZ 520.99 N N N N N N N N N N N N N N N N N N N N N N N N N N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier Epam Lost Unknown 2014-01-01 EPAM.COM IAH CUN IAH-CUN 01 14 USD 520.99 4 19 N N 0 N DID 2 DID DID DID CHEAPTICKETS 520.99 ORBITZ 520.99 N N N N N N N N N N N N N N N N N N N N N N N N N N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier Epam Lost Unknown 2014-01-01 EPAM.COM
  • 18. Hive Streaming: Pitfalls • Add script in Distributed Cash before running query with Hive Streaming • Use last columns in select statement for Dynamic Partitioning • Use more robust separator (default, TAB) to prevent inconsistency of data Note: always use iterator/generator (python methodology) functions instead of explicit reading from stdin! It saves system resources and executes script much faster (more over than 10 times) Example: def read_input(file): for line in file: # split the line into words yield line.strip() data = read_input(sys.stdin) for words in data: … for words in sys.stdin: …
  • 19. SECTION Hadoop MR Streaming in Hive BENCHMARKS 19
  • 20. Hive Streaming: Benchmarks Hive "case … when" clause Source: MANAGED, Non-partitioned, 2M rows Target: MANAGED, Non-Partitioned Time spent: 2m39s
  • 21. Hive Streaming: Benchmarks Hive Streaming Source: MANAGED, Non-partitioned, 2M rows Target: MANAGED, Non-Partitioned Time spent: 4m53s Note: no compression for output, so “Number of bytes written extremely larger
  • 22. Hive Streaming: Benchmarks Hive "case … when" clause Source: MANAGED, Non-partitioned, 2M rows Target: MANAGED, Partitioned by 2 columns Time spent: 2m44s
  • 23. Hive Streaming: Benchmarks Hive Streaming Source: MANAGED, Non-partitioned, 2M rows Target: MANAGED, Partitioned by 2 columns Time spent: 5m12s
  • 24. Thanks! Join us at https://www.linkedin.com/groups/Belarus- Hadoop-User-Group-BHUG-8104884 yauheni_yushyn@epam.com skype: ushin.evgenij