SlideShare une entreprise Scribd logo
1  sur  43
Analytics using Apache Hive
with the power of Windowing
and Table functions:
Use Cases
Murtaza Doctor - murtaza@richrelevance.com
Principal Architect, RichRelevance

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Outline
•
•
•
•
•
•
•

{rr} story
What is Clickstream Analytics
Hive at {rr}
Windowing & PTF Framework
Case Study: use cases
Current, Next & Future
Q&A

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
RichRelevance {rr}

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
RichRelevance DataMesh
Data Ingestion
3rd
Party

Realtime

Customer
Data Store

Analytics &
Optimization

Clickstream
Catalog
Online sales
In-store sales
Ad impressions
Social profiles
Redemptions…

125+ models
Customer models
Product models
A/B, MVT testing
King-of-the-hill
optimization

Offline

Data Feeds
Real-time
Decisioning
(65 msec)

[Client]
Innovation
Cloud

Event
Triggered
(minutes)

Batch
Updates
(hours)

Reporting
(ad
hoc, OLAP, E
xcel)

Underlying Technologies:
Hadoop, HBase, Hive, Kafka, Avro, Voldemort, Postgres, Pentaho OLAP, R

Custom apps and APIs
Self-Serve
Analytics

Personalized
Category Sort

Real-time
Segmentation

Network Ad
Tracking
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

{rr} SaaS & APIs
Did You Know?
Our data capacity
includes a 1.5 PB
Hadoop
infrastructure, which
enables us to employ
100+ algorithms in realtime

Our cloud-based platform
supports both real-time
processes and analytical
use cases, utilizing
technologies to name a
few:
Crunch, Hive, HBase, Avro,
Azkaban, Voldemort, Kafka

In the US, we serve 7000
requests per second
with an average
response time of
50 ms

Someone clicks on a {rr}
recommendation
every 21 milliseconds

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
What is Clickstream
Analytics?

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
What is Clickstream Analytics?
•
•
•
•
•

Collect, Combine, Aggregate & Analyze
Clickstream – view, click, purchase events
It is all about the Session or Visit
User properties – userId, location etc
Site Optimization, Sentiment Analysis, Buying Patterns and
many more

Example: we use click through rate (clicks/sessions) to
measure how well ad placement positions are doing on
pages, and then can test them based on engagement to see if
other positions would work better.

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Getting MAD on Hive

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
MAD skills from Hellerstein’s paper
From Developer perspective Data
Platform should be:
Magnetic – Attract Data

As opposed to

With Hive

Having to justify loading any new
data to a DBA. The Quality and
schema regimes have the
adverse effect of Repelling Data

Agile – Data comes in many
Forcing a complex ETL process to
shapes and forms. Enable
bring data in.
bringing in Data in its native form.

Pluggable
• Formats
• Storage
Handlers
• Indices

Deep – Ability to operate on data Only SQL
directly; using existing algorithms
that operate on native formats.
SQL + M/C Learning + Graph + …

SQL + Map
Reduce scripts.
But can we do
better?

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Problem for App Developer
I want to do
•
•
•
•
•
•

Sessionization
Clustering
Collaborative Filtering
Fraud detection
Time Series Analysis
Churn Analysis

And I want to do combine these analysis with SQL
Analytic capabilities available in most Databases as:
•
•
•
•

User Defined Table Functions
External Table mechanisms etc.
Aster SQL/MR library provides functions for many of the Use Cases above
Oracle Stored Procedure + Table Functions used to provide Analytic
packages.

Our work: bring same capability to Hive.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Hive at {rr}
•
•
•
•
•

Real-time data in Hive
Getting to 1PB of data in Hive!
Hive Tables: Event types, Catalog, Rollups etc
Custom Serde
Partitioning scheme: most of the tables
partitioned by event date

{rr}
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Architecture

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Roadblocks to the Solution
•
•
•
•
•
•

Too many temporary tables
Random sampling
R for ranking & aggregate functions
R can only handle smaller data sets
Lots of self-joins
Inefficient queries

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Welcome to PTFs and
Windowing

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
3 Major SQL Concepts
1. Table Function
• Enable injecting custom logic
into the Query Data Flow
• Contract for TF is TableIn/Table-Out
• So opens up analysis
beyond row calculations
and aggregations
• Sessionize Fn. that decides
what weblog entries belong to
a Session.
• Syntactically function can
appear anywhere a Table can
in SQL.

Project
tableOut
Table Function
tableIn
Join

Select

2. Partitioned Table Function

Select

Project
Partitions Out

Table Function
Partitions In

Join

Select

Select

• a scaling mechanism
• Instead of operating on
the entire table divide
work into Partitions
• instances operating on
individual Partitions
don’t communicate.
• Divide weblog by Day or
Week and operate
independently
• Intuitively like MR:
processing PTF done
as MR jobs.

3. Windowing

current
row

• Operate on a set of rows
surrounding the current
row
• Windows defined like „5
preceding and 4 succeeding‟
• On the window allow
aggregations; and also
Navigation: lead, lag, First,
Last

PTFS and Windows related
• You do windowing after everything else: join, group by etc.
• You define windows on ordered Partitions
• You then do aggregations, inter row navigations on these
windows
• If all the Partitions across all Window expressions are the
same, then this is a special PTF.

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Ordered Partition: the central concept
select *
From Sessionize(weblog….)

Partition

Hive
Translator

In Functions:
• Analyze partition of rows as a unit
• Output is not a summary of rows
• Sessionization : relate events to
sessions.
• Market Basket: find most common
Product/Page combinations
In Windowing:
• Ranking: Rank, Tiling,
• Trending: Lead/Lag,
Cumulative Sum

SELECT ViewsData.*,
rank() as exit_rank
over(DISTRIBUTE BY sessionid
SORT BY timsetamp DESC),
FROM ViewsData

Hive
Translator

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Output
Partition
Example: Time Series Analysis
Time Series Analysis: Identify Flights that have a delay problem.
• We want to look at all the times a Flight happened and then make a
judgment.
• To do this: one conceivable starting point is to find occurrences where a
Flight was late 3 or more times in a row.
• Use these as a starting point for further analysis.
Flights Table
Origin

Fl. Num

Year

Month

Day

Arr. Delay

Boston

1017

2010

10

25

59.37

Boston

1017

2010

10

26

58.14

Boston

1017

2010

10

28

30.83

Boston

1017

2010

10

29

25.67

Pittsburgh

1058

2010

12

26

82.62

Analysis rows by Fl.
Number. Look for
sequences of Late
incidents.

Origin

FlNum

Year

Boston

1017

Boston
Pittsburg
h

Output aggregation
statistics about
these sequences.

Day

2010

Mont
h
10

25

Avg.
Delay
59.37

Num Of
Delays
8

1017

2010

11

10

41.54

7

1058

2010

12

26

82.62

8

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Use NPath PTF
Use a PTF: NPath
•
•
•

Helps you look for patterns in Time
User specifies Labels: interesting conditions, for e.g. LATE : arr_delay > 15 mins
Then specifies Patterns on Labels. Patterns are simple Regexes. For e.g.
•

•

LATE.LATE.LATE+  look for occurrences where a flight is 3 or more times late.

On Occurrences found (Occurrences are a set of rows) specify aggregation
calculations. For e.g.
•
•

Average Delay among late occurrences
Number of delays

3.

1. Query on Flights Table
select origin_city_name, fl_num, year, month, day_of_month, sz, tpath
from NPATH(
'LATE.LATE+',
'LATE', arr_delay > 15,
'origin_city_name, fl_num, year, month, day_of_month, size(tpath) as numDelay, arrAvg(tpath, “arrDelay”)
as avgDelay'
on
flights
distribute by fl_num
Looking at data
sort by year, month, day_of_month
per Flight; order
)

2.

within partition by
time

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

• Arg. 1 specify PATTERN
• Arg. 2 specify conditions as
LABELS
• Arg. 3 specify AGGR.
EXPRESSIONS
Runtime: PTF execution
Hive
Translator

Input DataSet

MR Job

Map Splits
Map Task
Rows

Table
Sc+an

Rows

Select

Partition

Reduce Task
Rows

Join

PTF

Shuffle controlled by
partition and order
specification

FileSink

Partition

Function

A PartitionedTableFunction (PTF)
given a Partition computes an output
Partition.
An invocation of PTF specifies how input
dataset should be partitioned and ordered.
A PTF defines shape of Output.
A PTF may operate on raw data before it is
partitioned and ordered.

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
{rr} Case Study on Windowing

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case I: Landing/Exit Page Rate
• First page the user lands on within a session
• Last page the user exits through a session
• Landing rate:
distribution of landing events by page type
• Exit rate:
distribution of exit events by page type
• Usage: SEO & Advertising

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case I: Landing/Exit Page
SELECT eventdate, landPage, exitPage, COUNT(DISTINCT sessionid)
FROM (
SELECT sessionid, eventdate,
first_value(pageType) over (partition by sessionid) as landPage,
last_value(pageType) over(partition by sessionid) as exitPage
FROM (
SELECT pageType, eventdate, sessionid, timestamp,
count(*) over(PARTITION
BY sessionid order by timestamp asc) as c,
rank() over(PARTITION
BY sessionid order by timestamp asc) as r
FROM views
WHERE siteid = 1 and
eventdate >= '2013-01-01' and evendate < '2013--01-13'
)a
WHERE r = 1 or r = c
)b
GROUP BY eventdate, landing_page, exit_page
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case I: Landing Page Breakdown

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case I: Landing Page Time Series

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case I: Exit Page Time Series

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case II a: Bounce Rate (by Page Type)
• Single page in session
• Landing Page is equal to Exit Page
• Usage: Site engagement metrics report

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case II a: Bounce Rate (by Page Type)
SELECT page_type, eventdate,
sum(case when c=1 then 1 else 0 end) as bounce_count,
count(1) as total_sessions
FROM (
SELECT page_type, eventdate, sessionid, timestamp,
count(*) over(PARTITION BY sessionid, eventdate order by
timestamp asc) as c,
rank() over(PARTITION BY sessionid, eventdate order by
timestamp asc) as bounce_rank
FROM views
WHERE siteid = 1 and
eventdate >= '2013-01-01' and evendate < '2013-01-13'
)a
WHERE bounce_rank = 1
GROUP by page_type, eventdate

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case II a: Bounce Rate (by Page Type)

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case II a: Bounce Rate Time Series

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case II b: New versus Repeat Traffic
• Comparison metric between first time visitors to
site v/s who came back more than once
• Usage: Insights into audience optimization

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case II b: New vs Repeat Traffic
SELECT userid, siteid, eventdate,
sum(case when c=1 then 1 else 0 end) new_users,
sum(case when c>1 then 1 else 0 end) repeat_users
FROM (
SELECT userid, siteid, eventdate,
count(*) over(PARTITION BY userid, siteid order by
eventdate as c,
rank() over(PARTITION BY userid, siteid order by
eventdate ) as rank
FROM views
WHERE siteid = 1 and
eventdate >= '2013-01-01' and eventdate < '2013-01-14’
) page_views
WHERE rank = 1
GROUP BY userid, siteid, eventdate;

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case III: Path to Purchase
• Most commonly taken path which leads to a
purchase
• Example: search page  item page  add to
cart  purchase
• Usage: Site Optimization, Attribution Models

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case III: Path to Purchase
SELECT sessionid, eventdate,
collect_set(page_type) as path_to_purchase
FROM (
SELECT sessionid, eventdate, page_type,
last_value(page_type) over(PARTITION BY sessionid, eventdate
order by timestamp) as last_page
FROM (
SELECT sessionid, eventdate, timestamp, 'purchase' as page_type
FROM purchases
WHERE siteid=999 and eventdate = '2013-01-01'
UNION ALL
SELECT sessionid, eventdate, timestamp, page_type
FROM views
WHERE siteid = 1 and eventdate = '2013-01-01'
)a
)b
WHERE
last_page = 'purchase'
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case IV: Most Frequent Next Action
• Path a user takes, speaks a lot about user
experience
• Next most common action
• Example: Search  item page
• Usage: Site Optimization

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case IV: Most Frequent Next Action
SELECT page_type, next_page_type, c
FROM(
SELECT sessionid, page_type,
lead(page_type,1) OVER (PARTITION BY sessionid sort by
timestamp asc) as next_page_type,
count(*) OVER (PARTITION BY sessionid sort by
timestamp asc) as c,
rank() ) OVER (PARTITION BY sessionid sort by
timestamp asc) as page_view
FROM views where siteid = 1 and eventdate='2013-01-01‟
)a
GROUP BY page_type, next_page_type;

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case IV: Most Frequent Next Action

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case V: Purchase Co-Occurrence
• People who bought X also bought Y
• List of products more frequently bought in the
same orders as a user specified list of products
• Usage: Provides behavioral insights that would
not surface in sales metrics

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Case V: Purchase Co-Occurrence
SELECT siteid, eventdate, userid, sessionid, ip, timestamp, ordernumber,
prods (
SELECT siteid, eventdate, userid, sessionid, ip, timestamp,
ordernumber, prods.productid as productid sum(case when
find_in_set(prods.productid, 'P1,P2,P3') > 0 then 1 else 0)
OVER (PARTITION BY purchase_complete_page. ordernumber
rows between unbounded preceding and unbounded following) as
matches, collect_set(prods.productid)
OVER(PARTITION BY purchase_complete_page.ordernumber
rows between unbounded preceding and unbounded following) as
prods, rank() OVER (PARTITION BY
purchase_complete_page.ordernumber
rows between unbounded preceding and unbounded following) as r
FROM purchases explode(purchase_complete_page.productspurchased)
prodTable as prods
WHERE eventdate >= $P{startdate} and
eventdate <= $P{enddate} and
siteid = $P{siteid}
)
WHERE matches >= 3 and r = 1

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Solution: Current, Next &
Future
{rr}

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Solution: Current
History of this project (started by Harish Butani)
• First provided this functionality on top of Hive
• See Github project for details & Hadoop Summit talk from Harish Butani
on this
• Had more functions and features, but not ideal
• So started to fold into Hive in November 2012
• 3 patches for HQL: see Jira 896
• A separate „windowing & ptf‟ hive branch

Hive Journey
•
•
•
•

Available as HiveQL
Currently part of Hive 0.11
Equivalent to functionality provided by Postgres
Differences are documented in Jira 4197

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Solution: Next
Solidify Infrastructure
• Performance improvements
• Dynamic Registration of PTFs.

More Functions
• Candidate Frequent Itemsets: key process in Market Basket Analysis
• TimeLine: another kind of time series analysis, based on a
RichRelevance use case.

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Solution: Future
Use PTF mechanism to integrate:
• R as R script PTF
• Mahout functions as Mahout PTF
• Groovy script PTF

Reduce Task
Rows

Join

Query structure:
Select ….
From Rscript(
‘r script’
on Npath(args…
On Flights..
)
)

rFn.

PTF

FileSink

rJava

rEngine

 Npath identifies interesting incidents
 Use R to make final decision

Partition

R Data Frame

Multi pass PTF Operator:
• Enable Iterative Algorithms:
Clustering, Market basket
Analysis, Graph traversal etc.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
{rr} richrelevance
is hiring!

Thank You

© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Contenu connexe

Tendances

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...Databricks
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Spark Summit
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityFabian Hueske
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 

Tendances (20)

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 

En vedette

Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionMurtaza Doctor
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupRemus Rusanu
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetupt3rmin4t0r
 
Really using Oracle analytic SQL functions
Really using Oracle analytic SQL functionsReally using Oracle analytic SQL functions
Really using Oracle analytic SQL functionsKim Berg Hansen
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveDataWorks Summit
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetupt3rmin4t0r
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive
 
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's ValueTech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's ValueCorum Group
 

En vedette (20)

Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to Action
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
What's new in Apache Hive
What's new in Apache HiveWhat's new in Apache Hive
What's new in Apache Hive
 
Really using Oracle analytic SQL functions
Really using Oracle analytic SQL functionsReally using Oracle analytic SQL functions
Really using Oracle analytic SQL functions
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache Hive
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016
 
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's ValueTech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
 

Similaire à Analytics using Apache Hive with Windowing and Table Functions

Building an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsBuilding an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsDigitalOcean
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayGrega Kespret
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
Scalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and HowScalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and HowCambridge Semantics
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...PerformanceVision (previously SecurActive)
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic
 
Kks sre book_ch10
Kks sre book_ch10Kks sre book_ch10
Kks sre book_ch10Chris Huang
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsUSGProfessionalsBelgium
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsGuyVanderSande
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendNicolas Carlier
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
La potenza è nulla senza controllo
La potenza è nulla senza controlloLa potenza è nulla senza controllo
La potenza è nulla senza controlloGiuliano Latini
 
La potenza è nulla senza controllo
La potenza è nulla senza controlloLa potenza è nulla senza controllo
La potenza è nulla senza controlloGiuliano Latini
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesMarco Parenzan
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
 

Similaire à Analytics using Apache Hive with Windowing and Table Functions (20)

Building an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsBuilding an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult Steps
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
Scalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and HowScalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and How
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Kks sre book_ch10
Kks sre book_ch10Kks sre book_ch10
Kks sre book_ch10
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of Things
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of Things
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
La potenza è nulla senza controllo
La potenza è nulla senza controlloLa potenza è nulla senza controllo
La potenza è nulla senza controllo
 
La potenza è nulla senza controllo
La potenza è nulla senza controlloLa potenza è nulla senza controllo
La potenza è nulla senza controllo
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data Services
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
 

Dernier

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Dernier (20)

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

Analytics using Apache Hive with Windowing and Table Functions

  • 1. Analytics using Apache Hive with the power of Windowing and Table functions: Use Cases Murtaza Doctor - murtaza@richrelevance.com Principal Architect, RichRelevance © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 2. Outline • • • • • • • {rr} story What is Clickstream Analytics Hive at {rr} Windowing & PTF Framework Case Study: use cases Current, Next & Future Q&A © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 3. RichRelevance {rr} © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 4. RichRelevance DataMesh Data Ingestion 3rd Party Realtime Customer Data Store Analytics & Optimization Clickstream Catalog Online sales In-store sales Ad impressions Social profiles Redemptions… 125+ models Customer models Product models A/B, MVT testing King-of-the-hill optimization Offline Data Feeds Real-time Decisioning (65 msec) [Client] Innovation Cloud Event Triggered (minutes) Batch Updates (hours) Reporting (ad hoc, OLAP, E xcel) Underlying Technologies: Hadoop, HBase, Hive, Kafka, Avro, Voldemort, Postgres, Pentaho OLAP, R Custom apps and APIs Self-Serve Analytics Personalized Category Sort Real-time Segmentation Network Ad Tracking © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. {rr} SaaS & APIs
  • 5. Did You Know? Our data capacity includes a 1.5 PB Hadoop infrastructure, which enables us to employ 100+ algorithms in realtime Our cloud-based platform supports both real-time processes and analytical use cases, utilizing technologies to name a few: Crunch, Hive, HBase, Avro, Azkaban, Voldemort, Kafka In the US, we serve 7000 requests per second with an average response time of 50 ms Someone clicks on a {rr} recommendation every 21 milliseconds © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 6. What is Clickstream Analytics? © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 7. What is Clickstream Analytics? • • • • • Collect, Combine, Aggregate & Analyze Clickstream – view, click, purchase events It is all about the Session or Visit User properties – userId, location etc Site Optimization, Sentiment Analysis, Buying Patterns and many more Example: we use click through rate (clicks/sessions) to measure how well ad placement positions are doing on pages, and then can test them based on engagement to see if other positions would work better. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 8. Getting MAD on Hive © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 9. MAD skills from Hellerstein’s paper From Developer perspective Data Platform should be: Magnetic – Attract Data As opposed to With Hive Having to justify loading any new data to a DBA. The Quality and schema regimes have the adverse effect of Repelling Data Agile – Data comes in many Forcing a complex ETL process to shapes and forms. Enable bring data in. bringing in Data in its native form. Pluggable • Formats • Storage Handlers • Indices Deep – Ability to operate on data Only SQL directly; using existing algorithms that operate on native formats. SQL + M/C Learning + Graph + … SQL + Map Reduce scripts. But can we do better? © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 10. Problem for App Developer I want to do • • • • • • Sessionization Clustering Collaborative Filtering Fraud detection Time Series Analysis Churn Analysis And I want to do combine these analysis with SQL Analytic capabilities available in most Databases as: • • • • User Defined Table Functions External Table mechanisms etc. Aster SQL/MR library provides functions for many of the Use Cases above Oracle Stored Procedure + Table Functions used to provide Analytic packages. Our work: bring same capability to Hive. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 11. Hive at {rr} • • • • • Real-time data in Hive Getting to 1PB of data in Hive! Hive Tables: Event types, Catalog, Rollups etc Custom Serde Partitioning scheme: most of the tables partitioned by event date {rr} © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 12. Architecture © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 13. Roadblocks to the Solution • • • • • • Too many temporary tables Random sampling R for ranking & aggregate functions R can only handle smaller data sets Lots of self-joins Inefficient queries © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 14. Welcome to PTFs and Windowing © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 15. 3 Major SQL Concepts 1. Table Function • Enable injecting custom logic into the Query Data Flow • Contract for TF is TableIn/Table-Out • So opens up analysis beyond row calculations and aggregations • Sessionize Fn. that decides what weblog entries belong to a Session. • Syntactically function can appear anywhere a Table can in SQL. Project tableOut Table Function tableIn Join Select 2. Partitioned Table Function Select Project Partitions Out Table Function Partitions In Join Select Select • a scaling mechanism • Instead of operating on the entire table divide work into Partitions • instances operating on individual Partitions don’t communicate. • Divide weblog by Day or Week and operate independently • Intuitively like MR: processing PTF done as MR jobs. 3. Windowing current row • Operate on a set of rows surrounding the current row • Windows defined like „5 preceding and 4 succeeding‟ • On the window allow aggregations; and also Navigation: lead, lag, First, Last PTFS and Windows related • You do windowing after everything else: join, group by etc. • You define windows on ordered Partitions • You then do aggregations, inter row navigations on these windows • If all the Partitions across all Window expressions are the same, then this is a special PTF. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 16. Ordered Partition: the central concept select * From Sessionize(weblog….) Partition Hive Translator In Functions: • Analyze partition of rows as a unit • Output is not a summary of rows • Sessionization : relate events to sessions. • Market Basket: find most common Product/Page combinations In Windowing: • Ranking: Rank, Tiling, • Trending: Lead/Lag, Cumulative Sum SELECT ViewsData.*, rank() as exit_rank over(DISTRIBUTE BY sessionid SORT BY timsetamp DESC), FROM ViewsData Hive Translator © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Output Partition
  • 17. Example: Time Series Analysis Time Series Analysis: Identify Flights that have a delay problem. • We want to look at all the times a Flight happened and then make a judgment. • To do this: one conceivable starting point is to find occurrences where a Flight was late 3 or more times in a row. • Use these as a starting point for further analysis. Flights Table Origin Fl. Num Year Month Day Arr. Delay Boston 1017 2010 10 25 59.37 Boston 1017 2010 10 26 58.14 Boston 1017 2010 10 28 30.83 Boston 1017 2010 10 29 25.67 Pittsburgh 1058 2010 12 26 82.62 Analysis rows by Fl. Number. Look for sequences of Late incidents. Origin FlNum Year Boston 1017 Boston Pittsburg h Output aggregation statistics about these sequences. Day 2010 Mont h 10 25 Avg. Delay 59.37 Num Of Delays 8 1017 2010 11 10 41.54 7 1058 2010 12 26 82.62 8 © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 18. Use NPath PTF Use a PTF: NPath • • • Helps you look for patterns in Time User specifies Labels: interesting conditions, for e.g. LATE : arr_delay > 15 mins Then specifies Patterns on Labels. Patterns are simple Regexes. For e.g. • • LATE.LATE.LATE+  look for occurrences where a flight is 3 or more times late. On Occurrences found (Occurrences are a set of rows) specify aggregation calculations. For e.g. • • Average Delay among late occurrences Number of delays 3. 1. Query on Flights Table select origin_city_name, fl_num, year, month, day_of_month, sz, tpath from NPATH( 'LATE.LATE+', 'LATE', arr_delay > 15, 'origin_city_name, fl_num, year, month, day_of_month, size(tpath) as numDelay, arrAvg(tpath, “arrDelay”) as avgDelay' on flights distribute by fl_num Looking at data sort by year, month, day_of_month per Flight; order ) 2. within partition by time © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Arg. 1 specify PATTERN • Arg. 2 specify conditions as LABELS • Arg. 3 specify AGGR. EXPRESSIONS
  • 19. Runtime: PTF execution Hive Translator Input DataSet MR Job Map Splits Map Task Rows Table Sc+an Rows Select Partition Reduce Task Rows Join PTF Shuffle controlled by partition and order specification FileSink Partition Function A PartitionedTableFunction (PTF) given a Partition computes an output Partition. An invocation of PTF specifies how input dataset should be partitioned and ordered. A PTF defines shape of Output. A PTF may operate on raw data before it is partitioned and ordered. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 20. {rr} Case Study on Windowing © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 21. Case I: Landing/Exit Page Rate • First page the user lands on within a session • Last page the user exits through a session • Landing rate: distribution of landing events by page type • Exit rate: distribution of exit events by page type • Usage: SEO & Advertising © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 22. Case I: Landing/Exit Page SELECT eventdate, landPage, exitPage, COUNT(DISTINCT sessionid) FROM ( SELECT sessionid, eventdate, first_value(pageType) over (partition by sessionid) as landPage, last_value(pageType) over(partition by sessionid) as exitPage FROM ( SELECT pageType, eventdate, sessionid, timestamp, count(*) over(PARTITION BY sessionid order by timestamp asc) as c, rank() over(PARTITION BY sessionid order by timestamp asc) as r FROM views WHERE siteid = 1 and eventdate >= '2013-01-01' and evendate < '2013--01-13' )a WHERE r = 1 or r = c )b GROUP BY eventdate, landing_page, exit_page © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 23. Case I: Landing Page Breakdown © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 24. Case I: Landing Page Time Series © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 25. Case I: Exit Page Time Series © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 26. Case II a: Bounce Rate (by Page Type) • Single page in session • Landing Page is equal to Exit Page • Usage: Site engagement metrics report © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 27. Case II a: Bounce Rate (by Page Type) SELECT page_type, eventdate, sum(case when c=1 then 1 else 0 end) as bounce_count, count(1) as total_sessions FROM ( SELECT page_type, eventdate, sessionid, timestamp, count(*) over(PARTITION BY sessionid, eventdate order by timestamp asc) as c, rank() over(PARTITION BY sessionid, eventdate order by timestamp asc) as bounce_rank FROM views WHERE siteid = 1 and eventdate >= '2013-01-01' and evendate < '2013-01-13' )a WHERE bounce_rank = 1 GROUP by page_type, eventdate © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 28. Case II a: Bounce Rate (by Page Type) © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 29. Case II a: Bounce Rate Time Series © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 30. Case II b: New versus Repeat Traffic • Comparison metric between first time visitors to site v/s who came back more than once • Usage: Insights into audience optimization © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 31. Case II b: New vs Repeat Traffic SELECT userid, siteid, eventdate, sum(case when c=1 then 1 else 0 end) new_users, sum(case when c>1 then 1 else 0 end) repeat_users FROM ( SELECT userid, siteid, eventdate, count(*) over(PARTITION BY userid, siteid order by eventdate as c, rank() over(PARTITION BY userid, siteid order by eventdate ) as rank FROM views WHERE siteid = 1 and eventdate >= '2013-01-01' and eventdate < '2013-01-14’ ) page_views WHERE rank = 1 GROUP BY userid, siteid, eventdate; © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 32. Case III: Path to Purchase • Most commonly taken path which leads to a purchase • Example: search page  item page  add to cart  purchase • Usage: Site Optimization, Attribution Models © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 33. Case III: Path to Purchase SELECT sessionid, eventdate, collect_set(page_type) as path_to_purchase FROM ( SELECT sessionid, eventdate, page_type, last_value(page_type) over(PARTITION BY sessionid, eventdate order by timestamp) as last_page FROM ( SELECT sessionid, eventdate, timestamp, 'purchase' as page_type FROM purchases WHERE siteid=999 and eventdate = '2013-01-01' UNION ALL SELECT sessionid, eventdate, timestamp, page_type FROM views WHERE siteid = 1 and eventdate = '2013-01-01' )a )b WHERE last_page = 'purchase' © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 34. Case IV: Most Frequent Next Action • Path a user takes, speaks a lot about user experience • Next most common action • Example: Search  item page • Usage: Site Optimization © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 35. Case IV: Most Frequent Next Action SELECT page_type, next_page_type, c FROM( SELECT sessionid, page_type, lead(page_type,1) OVER (PARTITION BY sessionid sort by timestamp asc) as next_page_type, count(*) OVER (PARTITION BY sessionid sort by timestamp asc) as c, rank() ) OVER (PARTITION BY sessionid sort by timestamp asc) as page_view FROM views where siteid = 1 and eventdate='2013-01-01‟ )a GROUP BY page_type, next_page_type; © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 36. Case IV: Most Frequent Next Action © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 37. Case V: Purchase Co-Occurrence • People who bought X also bought Y • List of products more frequently bought in the same orders as a user specified list of products • Usage: Provides behavioral insights that would not surface in sales metrics © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 38. Case V: Purchase Co-Occurrence SELECT siteid, eventdate, userid, sessionid, ip, timestamp, ordernumber, prods ( SELECT siteid, eventdate, userid, sessionid, ip, timestamp, ordernumber, prods.productid as productid sum(case when find_in_set(prods.productid, 'P1,P2,P3') > 0 then 1 else 0) OVER (PARTITION BY purchase_complete_page. ordernumber rows between unbounded preceding and unbounded following) as matches, collect_set(prods.productid) OVER(PARTITION BY purchase_complete_page.ordernumber rows between unbounded preceding and unbounded following) as prods, rank() OVER (PARTITION BY purchase_complete_page.ordernumber rows between unbounded preceding and unbounded following) as r FROM purchases explode(purchase_complete_page.productspurchased) prodTable as prods WHERE eventdate >= $P{startdate} and eventdate <= $P{enddate} and siteid = $P{siteid} ) WHERE matches >= 3 and r = 1 © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 39. Solution: Current, Next & Future {rr} © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 40. Solution: Current History of this project (started by Harish Butani) • First provided this functionality on top of Hive • See Github project for details & Hadoop Summit talk from Harish Butani on this • Had more functions and features, but not ideal • So started to fold into Hive in November 2012 • 3 patches for HQL: see Jira 896 • A separate „windowing & ptf‟ hive branch Hive Journey • • • • Available as HiveQL Currently part of Hive 0.11 Equivalent to functionality provided by Postgres Differences are documented in Jira 4197 © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 41. Solution: Next Solidify Infrastructure • Performance improvements • Dynamic Registration of PTFs. More Functions • Candidate Frequent Itemsets: key process in Market Basket Analysis • TimeLine: another kind of time series analysis, based on a RichRelevance use case. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 42. Solution: Future Use PTF mechanism to integrate: • R as R script PTF • Mahout functions as Mahout PTF • Groovy script PTF Reduce Task Rows Join Query structure: Select …. From Rscript( ‘r script’ on Npath(args… On Flights.. ) ) rFn. PTF FileSink rJava rEngine  Npath identifies interesting incidents  Use R to make final decision Partition R Data Frame Multi pass PTF Operator: • Enable Iterative Algorithms: Clustering, Market basket Analysis, Graph traversal etc. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 43. {rr} richrelevance is hiring! Thank You © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.

Notes de l'éditeur

  1. Hive – DataWarehouse System for hadoopHow Harish &amp; I met and we decided to collaborate
  2. How we plan to go over stuff
  3. Nuggets or Data Points1.5PB not as big as yahoo or facebook – huge from a retail industry perspective
  4. Site Optimization and others are just few of the use cases which can be solved by leveraging ClickStream Analytics
  5. Hive usage at {rr}
  6. So the picture in your mind should be:- The user specifies a Function in SQL anywhere a Table can appear- Behind the scenes: at runtime the Function is responsible for taking a Partition &amp; returning a Partition.Or:- user specifies one or more Windowing expressions- behind the scenes the internal Windowing Table Function processes the data, partition by partition.Windowing and PTF infrastructure is the same
  7. Npath get the example from Hive
  8. - One last thing, a quick picture of runtime- Here is now PTFs fit into the Hive flow.- A Query is translated in a set of Jobs by the Hive Driver.- Within each task, one or more SQL Operators are executed.- These operate on a stream of rows.- For PTFs a new PTF Operator gets injected into the reduce side. - It collects rows in a partition into a Partition object and invokes the PTF Function.- Whose job is to provide an output Partition; whose rows get injected back into the stream of rows.
  9. Fluent way to do things
  10. RANK function Inner query selects a certain set of fields partitions the data by sessionId and sorts views in that session by timestamp or order in which they have occurred starting with the first one. This query then only selects the first event of that session and that comes from rank=1Outer query groups the data by page_type and applies the count aggregate function to the sessionId
  11. Example just does a countLanding events are pages where referral id is not NULLGoogle  landing events in a session  item page - non bounce pageSessions which have one row one where rank() = 1If you want to compute by a session using a time – you are computing a difference between the frist &amp; last – FIRST &amp; LAST value
  12. Highlighting that the window does not have be number range It can be value basedIn a row in a session you want to look ahead: what some one time every activity Timeline function – Table Functions lot more leeway: some kind of pathing just like NPATH
  13. How is it different from last one- Lead function - cannot pivot the value 0 fundamental pattern are the same
  14. How about the following:If I understand the schema, the query below should give you the Orders andthe products purchased that contain all the listed products.So say the products you are looking for are &apos;P1,P2,P3&apos;, then the sum willgive you a count of the products in this Order that match one of thelisted products.The having clause will filter out all Orders that don&apos;t have at least 3matches (I.e. Matching all the listed products)The r = 1 condition will return 1 row per order.The o/p is of the form:OrderNumber, {products in order as a set}, other detailsŠCan of course return each product in the Order as a separate row if youwant to do more aggregation. For e.g count the orders that these productsappear in and then rank them or set up a cutoff threshold etc.
  15. Notes: R and SQLThis would bring a different wayPull data into RPush R functionality where data is?Who is thinking about this future?