SlideShare a Scribd company logo
1 of 27
Building a Graph Database in
Neo4j with Spark & Spark SQL
to Gain New Insights from Log
Data
ROBERT HRYNIEWICZ – DATA SCIENTIST/EVANGELIST,
HORTONWORKS
RACHEL POULSEN – DATA SCIENCE DIRECTOR, TIVO
Overview
 Overview of business problem
 Introduction to TiVo and TiVo data
 Challenges with using data to optimize UI navigation
 Solution
 Demo
 Next Steps
 Questions
TiVo
Background and context
 TiVo is a discovery platform for integrated entertainment
 Multiple ways to find content
TiVo Roamio Demo Video
 TiVo collects ~2 million logs a day from their boxes
 User action events, TiVo action events, inventory events
 These events are “memory-less” and don’t know what happened prior to the
event
Motivation
TiVo Business Initiatives
 Help users get to content they want faster
 Help users discover new content easier
 Is feature X important to discovering and getting to content? (ex: Is the guide
still used to find content?
Challenge
 Measuring a KPI for initiatives in a log stream that is “memory-less”
 Identifying events or a pattern of events that impacts the KPIs
Data Objective – “Path Analysis”
 Analysis that answers questions around the navigational paths users take to
get to or from defined start and/or end points
Architecture Challenges
 Traditional data platform that was sample-based and SQL-based
 Relational databases
Technical Challenges
 Relational Databases Challenges
 Little flexibility in “click path” definition
 Decisions about defining “paths” are made during processing step
 Many business assumptions have to be made with little insight
Solution
Solution
 Graph Database (Neo4j)
 Relationships are first-class citizens
 Simple abstractions
 Enable sophisticated models
 “Path Analysis”
Prototype
One Day Graph Info
 Edges: 57K relationships
 Nodes:
135 UI or “screen” nodes
12 “watch content” nodes
Size reduction (2K times)
 70 GB log data  35MB Neo DB (all nodes & edges)
 1 Day  Oct 1, 2015
LOGS
70+ GB
UI nodes
Watch nodes
Edges
35 MB
Screens, Transitions, and Content
 Screen events
 Remote button press events
 Watch content events
Sample Graph
Live Movie My Shows
TiVo Central
(Home)
Switched to
Switched to
Switched to
Architecture Overview
What’s captured in the graph?
 Node (UI)
 Name
 Timestamp
 Node (Watch)
 Type, e.g. Recorded
 Genre, e.g. TV Show
 Timestamp
 Edge
 Average Time
 Total number of keys pressed
 Key sequence
 e.g. Home  Up/Down  Select/Play
 Total number of times path taken
 Unique number of users taking this path
What’s captured in the graph?
TiVo
Central
1/1/2016 0.4s average time
3 keys pressed
Home-Down-Select
50 times path used
27 unique users
Live Movie
1/1/2016
Raw Log File example
…
1444809715713072|Watch|live|WBINDT|MV|506|EP019641150097...
1444809715812909|Key|HOME
1444880816123454|UI|TivoCentralScreen
1444809716234553|Key|DOWN
1444809716354363|Key|SELECT
1444809716518701|Trick3|PLAY|116|1|100|-1
1444809719888072|Key|PLAY
1444809719889072|Trick3|PLAY|119|1|100|-1
1444809726966880|Watch|rec|WFXTDT|SH|508|...
…
Filtered Log
...
Watch: LIVE MOVIE
Key: HOME
UI: TIVO HOME
Key: DOWN
Key: SELECT
Key: PLAY
Watch: REC SHOW
…
Edge
Node
Node
Edge
Node
Same day
Algorithm Overview (1 of 2)
1. Filter for desired events
• Remove non-Screen, non-Watch, non-Key events
2. Session-ize and order logs to reflect Screen/Watch/Edge events
3. Define display for Key Press events - two formats
• Normal: SELECT & UP x 2 & GUIDE & SELECT
• Compact: TIVO & 9 KEYS & SELECT
4. Generate an Edge if transition < max time set by stakeholders (e.g. 5 min)
• For all logs find the following sequence:
Node X - timestamp x (start time)
key A
key B
key C
Node Y - timestamp y (end time)
Algorithm Overview (2 of 2)
5. For each unique node-edge-node calculate:
1. Average transition time
2. Number of transitions
3. Number of unique transitions
4. Number of keys pressed
5. Key sequence (normal or compact)
6. Export results to CSV files
DEMO
 What is the most popular path people take to get to content?
 live vs. recorded
 What percent of total paths are most popular?
 What path is most popular? Overall? Unique?
 What app is most popular?
 What percent of total paths involve the Guide screen?
Business Advantages
 Measure KPIs for time to content and content discovery
 Optimize KPIs (understanding user behavior that impacts the
KPIs)
 Enhance A/B Testing by helping to answer “why?”
 Simplify user experience across products
 Increase engagement with new content
 Understand feature usage interactions not only as a mutually
exclusive experience
Future Work
 Deploy to production -- multi-day queries
 Add relationships and nodes for feature usage
 Classify paths (“discovery” or “known destination”)
 Exploratory analysis
Thanks!
 @RobHryniewicz
 @Bayesbabe

More Related Content

What's hot

Stream Scaling in Pravega
Stream Scaling in PravegaStream Scaling in Pravega
Stream Scaling in Pravega
DataWorks Summit
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
DataWorks Summit
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Altan Khendup
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
DataWorks Summit
 

What's hot (20)

Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
Stream Scaling in Pravega
Stream Scaling in PravegaStream Scaling in Pravega
Stream Scaling in Pravega
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
Break Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and MicrosoftBreak Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and Microsoft
 
From an experiment to a real production environment
From an experiment to a real production environmentFrom an experiment to a real production environment
From an experiment to a real production environment
 
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLMake streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQL
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 
Databricks Whitelabel: Making Petabyte Scale Data Consumable to All Our Custo...
Databricks Whitelabel: Making Petabyte Scale Data Consumable to All Our Custo...Databricks Whitelabel: Making Petabyte Scale Data Consumable to All Our Custo...
Databricks Whitelabel: Making Petabyte Scale Data Consumable to All Our Custo...
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
 

Viewers also liked

Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 

Viewers also liked (11)

Docker Swarm Cluster
Docker Swarm ClusterDocker Swarm Cluster
Docker Swarm Cluster
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containers
 
Managing Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing KubernetesManaging Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing Kubernetes
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 

Similar to Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Skelton Thatcher Consulting Ltd
 
Define phase lean six sigma tollgate template
Define phase   lean six sigma tollgate templateDefine phase   lean six sigma tollgate template
Define phase lean six sigma tollgate template
Steven Bonacorsi
 

Similar to Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data (20)

Innovate2011 DevOps TSRM RTC
Innovate2011 DevOps TSRM RTCInnovate2011 DevOps TSRM RTC
Innovate2011 DevOps TSRM RTC
 
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
 
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation SlidesDSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
 
Define phase lean six sigma tollgate template
Define phase   lean six sigma tollgate templateDefine phase   lean six sigma tollgate template
Define phase lean six sigma tollgate template
 
You're Live, Now What?
You're Live, Now What?You're Live, Now What?
You're Live, Now What?
 
Practical operability techniques for teams - Matthew Skelton - Agile in the C...
Practical operability techniques for teams - Matthew Skelton - Agile in the C...Practical operability techniques for teams - Matthew Skelton - Agile in the C...
Practical operability techniques for teams - Matthew Skelton - Agile in the C...
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Please Define: Roles in User Experience Design
Please Define: Roles in User Experience DesignPlease Define: Roles in User Experience Design
Please Define: Roles in User Experience Design
 
DevOps feedback loops
DevOps feedback loopsDevOps feedback loops
DevOps feedback loops
 
DevOps by examples - Continuous Lifecycle London 2017
DevOps by examples - Continuous Lifecycle London 2017DevOps by examples - Continuous Lifecycle London 2017
DevOps by examples - Continuous Lifecycle London 2017
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
 
Practical operability techniques for teams - webinar - Skelton Thatcher & Unicom
Practical operability techniques for teams - webinar - Skelton Thatcher & UnicomPractical operability techniques for teams - webinar - Skelton Thatcher & Unicom
Practical operability techniques for teams - webinar - Skelton Thatcher & Unicom
 
Next-Generation IDS: A CEP Use Case in 10 Minutes
Next-Generation IDS: A CEP Use Case in 10 MinutesNext-Generation IDS: A CEP Use Case in 10 Minutes
Next-Generation IDS: A CEP Use Case in 10 Minutes
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
 
Just enough web ops for web developers
Just enough web ops for web developersJust enough web ops for web developers
Just enough web ops for web developers
 
ADC 2017 - DevOps by examples part II – feedback loop
ADC 2017 - DevOps by examples part II – feedback loopADC 2017 - DevOps by examples part II – feedback loop
ADC 2017 - DevOps by examples part II – feedback loop
 
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
Kick-off nieuwe Monitoring Werkgroep bij de GSE tijdens de Nationale GSE Conf...
 
DevOps Culture and Principles
DevOps Culture and PrinciplesDevOps Culture and Principles
DevOps Culture and Principles
 
Biz Nova It Project Bonus Slides
Biz Nova It Project Bonus SlidesBiz Nova It Project Bonus Slides
Biz Nova It Project Bonus Slides
 
CEE Logging Standard: Today and Tomorrow
CEE Logging Standard: Today and TomorrowCEE Logging Standard: Today and Tomorrow
CEE Logging Standard: Today and Tomorrow
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

  • 1. Building a Graph Database in Neo4j with Spark & Spark SQL to Gain New Insights from Log Data ROBERT HRYNIEWICZ – DATA SCIENTIST/EVANGELIST, HORTONWORKS RACHEL POULSEN – DATA SCIENCE DIRECTOR, TIVO
  • 2. Overview  Overview of business problem  Introduction to TiVo and TiVo data  Challenges with using data to optimize UI navigation  Solution  Demo  Next Steps  Questions
  • 3. TiVo Background and context  TiVo is a discovery platform for integrated entertainment  Multiple ways to find content TiVo Roamio Demo Video  TiVo collects ~2 million logs a day from their boxes  User action events, TiVo action events, inventory events  These events are “memory-less” and don’t know what happened prior to the event
  • 4.
  • 5.
  • 6.
  • 7.
  • 8. Motivation TiVo Business Initiatives  Help users get to content they want faster  Help users discover new content easier  Is feature X important to discovering and getting to content? (ex: Is the guide still used to find content? Challenge  Measuring a KPI for initiatives in a log stream that is “memory-less”  Identifying events or a pattern of events that impacts the KPIs Data Objective – “Path Analysis”  Analysis that answers questions around the navigational paths users take to get to or from defined start and/or end points
  • 9. Architecture Challenges  Traditional data platform that was sample-based and SQL-based  Relational databases
  • 10. Technical Challenges  Relational Databases Challenges  Little flexibility in “click path” definition  Decisions about defining “paths” are made during processing step  Many business assumptions have to be made with little insight
  • 12. Solution  Graph Database (Neo4j)  Relationships are first-class citizens  Simple abstractions  Enable sophisticated models  “Path Analysis”
  • 13. Prototype One Day Graph Info  Edges: 57K relationships  Nodes: 135 UI or “screen” nodes 12 “watch content” nodes
  • 14. Size reduction (2K times)  70 GB log data  35MB Neo DB (all nodes & edges)  1 Day  Oct 1, 2015 LOGS 70+ GB UI nodes Watch nodes Edges 35 MB
  • 15. Screens, Transitions, and Content  Screen events  Remote button press events  Watch content events
  • 16. Sample Graph Live Movie My Shows TiVo Central (Home) Switched to Switched to Switched to
  • 18. What’s captured in the graph?  Node (UI)  Name  Timestamp  Node (Watch)  Type, e.g. Recorded  Genre, e.g. TV Show  Timestamp  Edge  Average Time  Total number of keys pressed  Key sequence  e.g. Home  Up/Down  Select/Play  Total number of times path taken  Unique number of users taking this path
  • 19. What’s captured in the graph? TiVo Central 1/1/2016 0.4s average time 3 keys pressed Home-Down-Select 50 times path used 27 unique users Live Movie 1/1/2016
  • 20. Raw Log File example … 1444809715713072|Watch|live|WBINDT|MV|506|EP019641150097... 1444809715812909|Key|HOME 1444880816123454|UI|TivoCentralScreen 1444809716234553|Key|DOWN 1444809716354363|Key|SELECT 1444809716518701|Trick3|PLAY|116|1|100|-1 1444809719888072|Key|PLAY 1444809719889072|Trick3|PLAY|119|1|100|-1 1444809726966880|Watch|rec|WFXTDT|SH|508|... …
  • 21. Filtered Log ... Watch: LIVE MOVIE Key: HOME UI: TIVO HOME Key: DOWN Key: SELECT Key: PLAY Watch: REC SHOW … Edge Node Node Edge Node Same day
  • 22. Algorithm Overview (1 of 2) 1. Filter for desired events • Remove non-Screen, non-Watch, non-Key events 2. Session-ize and order logs to reflect Screen/Watch/Edge events 3. Define display for Key Press events - two formats • Normal: SELECT & UP x 2 & GUIDE & SELECT • Compact: TIVO & 9 KEYS & SELECT 4. Generate an Edge if transition < max time set by stakeholders (e.g. 5 min) • For all logs find the following sequence: Node X - timestamp x (start time) key A key B key C Node Y - timestamp y (end time)
  • 23. Algorithm Overview (2 of 2) 5. For each unique node-edge-node calculate: 1. Average transition time 2. Number of transitions 3. Number of unique transitions 4. Number of keys pressed 5. Key sequence (normal or compact) 6. Export results to CSV files
  • 24. DEMO  What is the most popular path people take to get to content?  live vs. recorded  What percent of total paths are most popular?  What path is most popular? Overall? Unique?  What app is most popular?  What percent of total paths involve the Guide screen?
  • 25. Business Advantages  Measure KPIs for time to content and content discovery  Optimize KPIs (understanding user behavior that impacts the KPIs)  Enhance A/B Testing by helping to answer “why?”  Simplify user experience across products  Increase engagement with new content  Understand feature usage interactions not only as a mutually exclusive experience
  • 26. Future Work  Deploy to production -- multi-day queries  Add relationships and nodes for feature usage  Classify paths (“discovery” or “known destination”)  Exploratory analysis

Editor's Notes

  1. What is “Path Analysis”? Analysis that answers questions around the navigational paths users take to get to or from defined start and/or end points
  2. What is “Path Analysis”? Analysis that answers questions around the navigational paths users take to get to or from defined start and/or end points
  3. Expensive C-code No flexibility in “click path” definition (make decision on processing, constrain number of “paths,” etc)
  4. Expensive C-code No flexibility in “click path” definition (make decision on processing, constrain number of “paths,” etc)
  5. Why Neo4J Team had exposure to the product before Mature product Expressive graph query language: Cypher Analysts instead of full-blown data scientists/statisticians to run queries
  6. Robert to clean up
  7. What is the most popular path people take to get to content? live vs. recorded content What percent of total paths are most popular? List top 5 paths with average time and percentage What percent of total users are most popular? List top 5 paths with average time and percentage In-degree connectivity.
  8. Evaluate whether consumers view programs differently depending on how they navigate to each program? Examine viewing/navigation options: Via program guide Via Season Pass/recorded programs By tuning directly to a station from set off or channel change