Streaming offers an alternative way to transform data. During a streaming job, the Hadoop Streaming API opens an I/O pipe to an external process. This presentation will tell how use Streaming feature in Hive to reduce code complexity with real story example.
4. Intro
Streaming offers an alternative way to transform data. During a streaming job, the Hadoop
Streaming API opens an I/O pipe to an external process
Unix like interface:
• Streaming API opens an I/O pipe to an external process
• Process reads data from the standard input and writes the results out through the standard
output
By default,
INPUT for user script:
• columns transformed to STRING
• delimited by TAB
• NULL values converted to the literal string N (differentiate NULL values from empty strings)
OUTPTUT of user script:
• treated as TAB-separated STRING columns
• N will be re-interpreted as a NULL
• resulting STRING column will be cast to the data type specified in the table declaration
These defaults can be overridden with ROW FORMAT
5. Pros and Cons
• Simplicity for developer, dealing with stdin/stdout
• Schema-less model, treat values as needed
• Non-Java interface
• Overhead for Serialization/Deserialization between
processes
• Disallowed when "SQL standard based
authorization" is configured (Hive 0.13.0 and later
releases)
6. Hive reference
• MAP()
• REDUCE()
• TRANSFORM()
Hive provides several clauses to use streaming: MAP(), REDUCE(), and
TRANSFORM().
Note:
MAP() does not actually force streaming during the map phase nor does
reduce force streaming to happen in the reduce phase. For this reason,
the functionally equivalent yet more generic TRANSFORM() clause is
suggested to avoid misleading the reader of the query.
8. Use case from Real Life
Requirements:
There’re 14 flags in source table in Hive, which
controls output values for 4 new fields in target table
Solutions:
• Hive "case … when" clause
• User Defined Function (UDF)
• Custom MR Job
• Hive Streaming
10. Hive "case … when" clause
• There’re more than 1,500 lines of code to
map flags with new fields (statement
repeats for every new output field)
• Complexity for debugging
• Fast execution
• SQL-like syntax
• All logic in one place (hql script)
10
11. UDF
• You are single consumer of UDF (for this particular
case, custom logic for single DataMart)
• Java-code
• Fast execution
• Pass only needed flags into UDF (in contrast with
Hive Streaming)
• In the final point: SQL-like syntax, All logic in one
place
• Java-code
12. Hive Streaming
• Slower execution (time for SerDe)
• Deal with all fields, not only flags (in contrast
with UDF)
• Reducing complexity of code using script
language
• Small size of code
• Fast developing
• Wide stack of programming languages
15. Hive Streaming: Realization
Python snippets:
• Create matrix (e.g., list of tuples) with flags and related values of fields
• Loop through INPUT
• Split INPUT by TAB
• Split data fields and flags
• Compare with matrix and get max possible matching
• Spill out data with new fields as TAB separated text
16. Hive Streaming: Source code
#!/usr/bin/env python
"""Mapper for Hive Streaming, using Python iterators and generators.Spill out new fields in accordance with input flags."""
import sys
import logging
def read_input(file):
"""Read data from STDIN using python generator"""
#yield "IAHtCUNtIAH-CUN
t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtN
tNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01-01tEpam.COM"
#yield "IAHtCUNtIAH-CUN
t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtN
tNtNtNtNtNtNtNtTRIPADVISOR - UStN1tN1tNtNtNtNtNt1tNtNtNt1tNtNtNtNtNtNtNtNtNt2014-01-01tEpam.COM"
for line in file:
yield line.strip()
def compare_flags(source, target):
"""Compare flags from source and target lists. Src/trg should have the same size"""
size = len(source)
out = list()
# Go through elemets, add 0 to OUT list if src/trg elements equals
for i in xrange(size):
if target[i] != '-':
if target[i] == source[i]:
out.append(0)
else:
#logging.debug("Position: %i. Values of src/trg not equals, skip: %s,%s" % (i, source[i], target[i]))
return None
#out.append(1)
else:
out.append('-')
return out
def main(separator='t'):
column_list =
["ORIGIN","DESTINATION","OND","CARRIER","LOS","BKG_WINDOW","LOCAL_CURRENCY","LOWEST_PRICE","PAGE","POSITION","XP_RANK","XP_PRICE","XP_COMPETED",
"XP_PRICE_DIFF","BML","NUMBER_SELLERS","XP_IS_HERO","ECPC_LOSS","PRICE_LOSS","OTA_1","OTA_1_PRICE","OTA_2","OTA_2_PRICE","OTA_3","OTA_3_PRICE","OT
A_4","OTA_4_PRICE","OTA_5","OTA_5_PRICE","OTA_6","OTA_6_PRICE","OTA_7","OTA_7_PRICE","OTA_8","OTA_8_PRICE","OTA_9","OTA_9_PRICE","OTA_10","OTA_10_PRI
CE","OTA_11","OTA_11_PRICE","OTA_12","OTA_12_PRICE","OTA_13","OTA_13_PRICE","OTA_14","OTA_14_PRICE","OTA_15","OTA_15_PRICE","PARTNER_NAME","RCXR","
DCXR","SPLIT_TICKET","DEPARTURE_DURATION","RETURN_DURATION","DEPARTURE_STOPS","RETURN_STOPS"]
flag_list =
["exp_listed_on_route_flag","exp_listed_on_carr_flag","exp_lst_on_itin_flag","carr_is_seller_flag","more_than_1_seller_flag","split_ticket_flag","exp_in_hero_flag","ota_in_hero_flag","
meta_in_hero_flag","carr_in_hero_flag","cheapest_prc_is_unique_flag","exp_prc_match_carr_flag","exp_prc_match_cheapest_flag","cheapest_ota_meta_prc_match_carr_flag"]
partition_list = ["SHOP_DATE", "PARTNER_POS"]
logging.debug("Star specifying vocabulary matrix")
target = [
(["Inventory","Epam not showing route","Epam Lost","Unknown"],["0","-","0","-","-","-","-","-","-","-","-","-","-","-"])
,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","0","-","-","-","-","-","-","-","-"])
,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier for Epam"],["1","0","0","1","1","0","-","-","-","-","-","-","-","-"])
,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier on Meta"],["1","0","0","1","0","0","-","-","-","-","-","-","-","-"])
,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","-","-","-","-","-","-","-","-","-"])
,(["Inventory","Epam not showing itinerary","Epam Lost","Unknown"],["1","1","0","-","1","0","-","-","-","-","-","-","-","-"])
,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","0","0","0","0","1","-","-","-","-","-","-","-","-"])
,(["Inventory","Unique Inventory","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"])
,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","1","0","0","-","-","-","-"])
,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","0","1","0","-","-","-","-"])
,(["Inventory","Unique Inventory","Unknown","Epam Won"],["1","1","1","0","0","0","1","0","0","0","-","-","-","-"])
,(["Inventory","Unique Inventory","Epam Lost","Suspected Carrier Restricted Content"],["1","1","0","1","0","0","0","0","0","1","-","-","-","-"])
,(["Inventory","Unique Inventory","Epam Lost","Unknown"],["1","1","0","0","0","0","-","-","-","-","-","-","-","-"])
,(["Price","Carrier more expensive","Undercutting carrier","Epam Won"],["1","1","1","1","-","0","1","0","0","0","-","0","-","-"])
,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","0","1","0","-","-","0","0"])
,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","1","0","0","-","-","0","0"])
,(["Price","Carrier cheapest","Epam Lost","Unknown"],["1","1","1","1","1","0","0","-","-","1","-","0","0","0"])
,(["Price","Carrier cheapest","Epam Lost","Carrier controlled pricing"],["1","1","1","1","-","0","0","0","0","1","1","0","-","-"])
,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","0","1","0","-","-","0","-"])
,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","1","0","0","-","-","0","-"])
,(["Price","Fees or charges","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"])
,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","0","1","0","-","-","0","-"])
,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","1","0","0","-","-","0","-"])
,(["Price","Fees or charges","Unknown","Epam Won"],["1","1","1","0","-","0","1","0","0","0","1","-","-","-"])
,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","1","0","0","0","-","0","1"])
,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","1","0","0","-","0","1"])
,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","0","1","0","0","-","1"])
,(["Rank","Rank","ECPC","Epam Won"],["1","1","1","-","1","-","1","0","0","0","0","-","1","-"])
,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","1","0","0","0","-","1","-"])
,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","1","0","0","-","1","-"])
,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","0","1","0","-","1","-"])
]
# Input comes from STDIN
data = read_input(sys.stdin)
header_list = column_list + flag_list + partition_list
logging.debug("Header for input data: %s" % header_list)
logging.debug("Start reading from STDIN")
# Loop through STDIN
for words in data:
#for words in sys.stdin:
logging.debug("-----------")
current_flags = list()
#words = words.strip()
words = words.split('t')
logging.debug("Input values from external process (STDIN): %s" % words)
logging.debug("Input length: %s" % len(words))
if (len(header_list) != len(words)):
logging.error("Length of IN data (%i) not equal Header length (%i)! Exit" % (len(words), len(header_list)))
sys.exit(1)
data_set = dict(zip(header_list, words))
logging.debug("Parsing of STDIN: %s" % data_set)
# Get flags
for flag in flag_list:
current_flags.append(data_set[flag])
logging.debug("Find flags: %s" % current_flags)
# Get list with result of comparison src/trg
compared_list = list()
logging.debug("Comparing flags with vocabulary...")
for k,v in target:
#logging.debug("key, value: %s,%s" % (k, v))
temp_out = compare_flags(current_flags,v)
if not temp_out:
continue
logging.debug("Match is found: %s" % temp_out)
compared_list.append((k, temp_out))
temp_out = list()
logging.debug("Comparing flags with vocabulary finished. List of matches: %s" % (compared_list))
# Find max occurrence of src in trg (find max-occurrence of zeros)
max_zeros = 0
out_fields = list()
max_flag_from_trg =list()
for k, v in compared_list:
#logging.debug("key, value: %s,%s" % (k, v))
count_zero = v.count(0)
if count_zero > max_zeros:
out_fields = k
max_flag_from_trg = v
if (not out_fields) or (not max_flag_from_trg):
logging.warning("Can't find values in vocabulary. Set values for DEFAULT")
logging.warning("Fields: %s" % out_fields)
logging.warning("Flags: %s" % max_flag_from_trg)
out_fields = ["DEFAULT" for x in xrange(len(target[0][0]))]
else:
logging.debug("Output fields found")
logging.debug("Fields: %s" % out_fields)
logging.debug("Flags: %s" % max_flag_from_trg)
# Output fields with flags in STDOUT
field_data = [data_set[x] for x in column_list]
partition_date = [data_set[x] for x in partition_list]
out_row = separator.join(field_data) + separator + separator.join(out_fields) + separator + separator.join(partition_date)
logging.debug("Output string: %s" % out_row)
print out_row
#print "%s%s%s%s%s" % (separator.join(field_data), separator, separator.join(out_fields), separator, separator.join(partition_date))
if __name__ == "__main__":
logging.basicConfig(level=logging.DEBUG, stream=sys.stderr,
#format='%(filename)s[LINE:%(lineno)d]# %(levelname)-8s [%(asctime)s] %(message)s'
format='[%(asctime)s][%(filename)s][%(levelname)s] %(message)s'
)
main()
17. Hive Streaming: Debug
echo -e “val11tval12t…val1Nnval21tval22t…val2N"|
./script_name.py
Example:
Put 2 lines (TSV) in stdin
echo -e "IAHtCUNtIAH-CUN
t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtN
tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01-
01tEPAM.COMnIAHtCUNtIAH-CUN
t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtN
tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01-
01tEPAM.COM“ | ./script_name.py
Get 2 lines with new fields (without flags) in stdout
IAH CUN IAH-CUN 01 14 USD 520.99 4 19 N N 0 N DID 2 DID DID DID CHEAPTICKETS
520.99 ORBITZ 520.99 N N N N N N N N N N N N N N N N N N N N
N N N N N N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier
Epam Lost Unknown 2014-01-01 EPAM.COM
IAH CUN IAH-CUN 01 14 USD 520.99 4 19 N N 0 N DID 2 DID DID DID CHEAPTICKETS
520.99 ORBITZ 520.99 N N N N N N N N N N N N N N N N N N N N
N N N N N N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier
Epam Lost Unknown 2014-01-01 EPAM.COM
18. Hive Streaming: Pitfalls
• Add script in Distributed Cash before running query with Hive Streaming
• Use last columns in select statement for Dynamic Partitioning
• Use more robust separator (default, TAB) to prevent inconsistency of data
Note: always use iterator/generator (python methodology) functions instead of
explicit reading from stdin! It saves system resources and executes script much
faster (more over than 10 times)
Example:
def read_input(file):
for line in file:
# split the line into words
yield line.strip()
data = read_input(sys.stdin)
for words in data:
…
for words in sys.stdin:
…
21. Hive Streaming:
Benchmarks
Hive Streaming
Source:
MANAGED, Non-partitioned,
2M rows
Target:
MANAGED, Non-Partitioned
Time spent: 4m53s
Note: no compression for output, so “Number of bytes written extremely larger