This document discusses how TiVo used a graph database and Spark to gain insights from their log data. TiVo collects log data from their set top boxes but faced challenges analyzing it to optimize the user experience. Their solution was to build a graph of user sessions in Neo4j connected by edges of user actions. This allowed them to analyze paths users take and understand how users discover and access content. They demonstrated analyzing the most popular paths and apps. The graph provided advantages over SQL for understanding interconnected user behavior.
Driving Behavioral Change for Information Management through Data-Driven Gree...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data
1. Building a Graph Database in
Neo4j with Spark & Spark SQL
to Gain New Insights from Log
Data
ROBERT HRYNIEWICZ – DATA SCIENTIST/EVANGELIST,
HORTONWORKS
RACHEL POULSEN – DATA SCIENCE DIRECTOR, TIVO
2. Overview
Overview of business problem
Introduction to TiVo and TiVo data
Challenges with using data to optimize UI navigation
Solution
Demo
Next Steps
Questions
3. TiVo
Background and context
TiVo is a discovery platform for integrated entertainment
Multiple ways to find content
TiVo Roamio Demo Video
TiVo collects ~2 million logs a day from their boxes
User action events, TiVo action events, inventory events
These events are “memory-less” and don’t know what happened prior to the
event
4.
5.
6.
7.
8. Motivation
TiVo Business Initiatives
Help users get to content they want faster
Help users discover new content easier
Is feature X important to discovering and getting to content? (ex: Is the guide
still used to find content?
Challenge
Measuring a KPI for initiatives in a log stream that is “memory-less”
Identifying events or a pattern of events that impacts the KPIs
Data Objective – “Path Analysis”
Analysis that answers questions around the navigational paths users take to
get to or from defined start and/or end points
10. Technical Challenges
Relational Databases Challenges
Little flexibility in “click path” definition
Decisions about defining “paths” are made during processing step
Many business assumptions have to be made with little insight
18. What’s captured in the graph?
Node (UI)
Name
Timestamp
Node (Watch)
Type, e.g. Recorded
Genre, e.g. TV Show
Timestamp
Edge
Average Time
Total number of keys pressed
Key sequence
e.g. Home Up/Down Select/Play
Total number of times path taken
Unique number of users taking this path
19. What’s captured in the graph?
TiVo
Central
1/1/2016 0.4s average time
3 keys pressed
Home-Down-Select
50 times path used
27 unique users
Live Movie
1/1/2016
20. Raw Log File example
…
1444809715713072|Watch|live|WBINDT|MV|506|EP019641150097...
1444809715812909|Key|HOME
1444880816123454|UI|TivoCentralScreen
1444809716234553|Key|DOWN
1444809716354363|Key|SELECT
1444809716518701|Trick3|PLAY|116|1|100|-1
1444809719888072|Key|PLAY
1444809719889072|Trick3|PLAY|119|1|100|-1
1444809726966880|Watch|rec|WFXTDT|SH|508|...
…
21. Filtered Log
...
Watch: LIVE MOVIE
Key: HOME
UI: TIVO HOME
Key: DOWN
Key: SELECT
Key: PLAY
Watch: REC SHOW
…
Edge
Node
Node
Edge
Node
Same day
22. Algorithm Overview (1 of 2)
1. Filter for desired events
• Remove non-Screen, non-Watch, non-Key events
2. Session-ize and order logs to reflect Screen/Watch/Edge events
3. Define display for Key Press events - two formats
• Normal: SELECT & UP x 2 & GUIDE & SELECT
• Compact: TIVO & 9 KEYS & SELECT
4. Generate an Edge if transition < max time set by stakeholders (e.g. 5 min)
• For all logs find the following sequence:
Node X - timestamp x (start time)
key A
key B
key C
Node Y - timestamp y (end time)
23. Algorithm Overview (2 of 2)
5. For each unique node-edge-node calculate:
1. Average transition time
2. Number of transitions
3. Number of unique transitions
4. Number of keys pressed
5. Key sequence (normal or compact)
6. Export results to CSV files
24. DEMO
What is the most popular path people take to get to content?
live vs. recorded
What percent of total paths are most popular?
What path is most popular? Overall? Unique?
What app is most popular?
What percent of total paths involve the Guide screen?
25. Business Advantages
Measure KPIs for time to content and content discovery
Optimize KPIs (understanding user behavior that impacts the
KPIs)
Enhance A/B Testing by helping to answer “why?”
Simplify user experience across products
Increase engagement with new content
Understand feature usage interactions not only as a mutually
exclusive experience
26. Future Work
Deploy to production -- multi-day queries
Add relationships and nodes for feature usage
Classify paths (“discovery” or “known destination”)
Exploratory analysis
What is “Path Analysis”?
Analysis that answers questions around the navigational paths users take to get to or from defined start and/or end points
What is “Path Analysis”?
Analysis that answers questions around the navigational paths users take to get to or from defined start and/or end points
Expensive C-code
No flexibility in “click path” definition (make decision on processing, constrain number of “paths,” etc)
Expensive C-code
No flexibility in “click path” definition (make decision on processing, constrain number of “paths,” etc)
Why Neo4J
Team had exposure to the product before
Mature product
Expressive graph query language: Cypher
Analysts instead of full-blown data scientists/statisticians to run queries
Robert to clean up
What is the most popular path people take to get to content?
live vs. recorded content
What percent of total paths are most popular?
List top 5 paths with average time and percentage
What percent of total users are most popular?
List top 5 paths with average time and percentage
In-degree connectivity.
Evaluate whether consumers view programs differently depending on how they navigate to each program?
Examine viewing/navigation options:
Via program guide
Via Season Pass/recorded programs
By tuning directly to a station from set off or channel change