Hive – DataWarehouse System for hadoopHow Harish & I met and we decided to collaborate
How we plan to go over stuff
Nuggets or Data Points1.5PB not as big as yahoo or facebook – huge from a retail industry perspective
Site Optimization and others are just few of the use cases which can be solved by leveraging ClickStream Analytics
Hive usage at {rr}
So the picture in your mind should be:- The user specifies a Function in SQL anywhere a Table can appear- Behind the scenes: at runtime the Function is responsible for taking a Partition & returning a Partition.Or:- user specifies one or more Windowing expressions- behind the scenes the internal Windowing Table Function processes the data, partition by partition.Windowing and PTF infrastructure is the same
Npath get the example from Hive
- One last thing, a quick picture of runtime- Here is now PTFs fit into the Hive flow.- A Query is translated in a set of Jobs by the Hive Driver.- Within each task, one or more SQL Operators are executed.- These operate on a stream of rows.- For PTFs a new PTF Operator gets injected into the reduce side. - It collects rows in a partition into a Partition object and invokes the PTF Function.- Whose job is to provide an output Partition; whose rows get injected back into the stream of rows.
Fluent way to do things
RANK function Inner query selects a certain set of fields partitions the data by sessionId and sorts views in that session by timestamp or order in which they have occurred starting with the first one. This query then only selects the first event of that session and that comes from rank=1Outer query groups the data by page_type and applies the count aggregate function to the sessionId
Example just does a countLanding events are pages where referral id is not NULLGoogle landing events in a session item page - non bounce pageSessions which have one row one where rank() = 1If you want to compute by a session using a time – you are computing a difference between the frist & last – FIRST & LAST value
Highlighting that the window does not have be number range It can be value basedIn a row in a session you want to look ahead: what some one time every activity Timeline function – Table Functions lot more leeway: some kind of pathing just like NPATH
How is it different from last one- Lead function - cannot pivot the value 0 fundamental pattern are the same
How about the following:If I understand the schema, the query below should give you the Orders andthe products purchased that contain all the listed products.So say the products you are looking for are 'P1,P2,P3', then the sum willgive you a count of the products in this Order that match one of thelisted products.The having clause will filter out all Orders that don't have at least 3matches (I.e. Matching all the listed products)The r = 1 condition will return 1 row per order.The o/p is of the form:OrderNumber, {products in order as a set}, other detailsŠCan of course return each product in the Order as a separate row if youwant to do more aggregation. For e.g count the orders that these productsappear in and then rank them or set up a cutoff threshold etc.
Notes: R and SQLThis would bring a different wayPull data into RPush R functionality where data is?Who is thinking about this future?