SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Insights into Customer Behavior
from Clickstream Data
Ronald J. Nowling
Red Hat, Inc.
rnowling@redhat.com
http://rnowling.github.io/
Who Am I?
•  Software Engineer at Red Hat
•  Data Science Team, Emerging Technologies
–  Evaluate solutions in open-source Big Data
space
–  Ensure software works for Red Hat customers
–  Promote data science internally through
consulting projects
•  Apache Bigtop PMC
2	
  
Clickstream Data
3	
  
Clickstream Data
61 million page views
4	
  
Clickstream Data
61 million page views
125,000 registered users
5	
  
Clickstream Data
61 million page views
125,000 registered users
500,000 pages
6	
  
Clickstream Data
61 million page views
125,000 registered users
500,000 pages
125,000 knowledgebase articles
7	
  
Potential Applications
•  Build customer profiles to aid sales teams
•  Recommendation system for
knowledgebase
•  Improve customer portal search
•  Guide selection of new knowledgebase
topics by content writers
8	
  
9	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different types of kernel packages in Red Hat
Enterprise Linux?
=============================================================
Issue
------
What are the different types of kernel packages in Red Hat
Enterprise Linux?
Environment
---------------
Red Hat Enterprise Linux
Resolution
------------
Red Hat Enterprise Linux contains the following kernel
packages:
10	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different types of kernel packages in Red Hat
Enterprise Linux
Issue
What are the different types of kernel packages in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contains the following kernel
packages some may not apply to your architecture and not all
are available in all major releases kernel contains the
kernel and following key features
11	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different types of kernel packages in Red Hat
Enterprise Linux
Issue
What are the different types of kernel packages in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contains the following kernel
packages some may not apply to your architecture and not all
are available in all major releases kernel contains the
kernel and following key features
12	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different type of kernel package in Red Hat
Enterprise Linux
Issue
What are the different type of kernel package in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contain the follow kernel
package some may not apply to your architecture and not all
are available in all major release kernel contain the
kernel and follow key feature
13	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different type of kernel package in Red Hat
Enterprise Linux
Issue
What are the different type of kernel package in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contain the follow kernel
package some may not apply to your architecture and not all
are available in all major release kernel contain the
kernel and follow key feature
14	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
different type kernel package Red Hat
Enterprise Linux
Issue
different type kernel package Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contain kernel
package apply architecture
available major release kernel contain
kernel follow key feature
15	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
kernel: 5
red: 4
hat: 4
enterprise: 4
linux: 4
package: 3
contain: 3
different: 2
type: 2
intel: 2
environment: 1
resolution: 1
follow: 1
system: 1
16	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
kernel: 5
red: 4
hat: 4
enterprise: 4
linux: 4
package: 3
contain: 3
different: 2
type: 2
intel: 2
environment: 1
resolution: 1
follow: 1
system: 1
17	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
kernel: 5
red: 4
hat: 4
enterprise: 4
linux: 4
package: 3
contain: 3
18	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
19	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
20	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
21	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
22	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
23	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
24	
  
Topic Article Counts
25	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
26	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
27	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
28	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
29	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
30	
  
Customer Profiles
•  Dominant topics
– JBoss
– Red Hat Enterprise Virtualization
– Hardware support
– Gluster
– Booting into rescue mode
– Packages
31	
  
Customer Profiles
•  Supporting topics
– Logging
– LDAP
– Samba
– High resource usage
– File systems / LVM / block devices
– Networking
32	
  
Customer Profiles
•  JBoss and RHEV appear in combination
with a number of other products
•  Some products only appear by
themselves with supporting topics
(logging, networking, filesystems)
– OpenShift
– Gluster
33	
  
Topic Enrichments
34	
  
Malformed TSV Files
•  Gzip files need to be read sequentially
•  Tab-separated, no quoting (in theory!)
•  Escaped tabs and newlines within records
•  E.g., n or t
•  Improperly escaped tabs and newlines
•  E.g., t vs t
•  Extraneous unmatched quote marks
•  E.g., ‘some_user
35	
  
Lessons Learned
•  Consider custom Hadoop input formats
for tricky file formats
•  Verify everything – what works in general
may not work for you
– Stemming
– Filtering most frequent words
– K-Means vs LDA
36	
  
Lessons Learned
•  K-Means
– Improve accuracy: Multiple runs, more
iterations
•  Watch out for memory leaks
– Un-persist cached RDDs
– Un-persist broadcasted variables
•  Parquet for performance
37	
  
Potential Applications
•  Build customer profiles to aid sales teams
•  Recommendation system for
knowledgebase
•  Improve customer portal search
•  Guide selection of new knowledgebase
topics for content writers
38	
  
Resources
http://rnowling.github.io/
39	
  
QUESTIONS
40	
  

Contenu connexe

Tendances

Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
Chris Johnson
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
NYC Predictive Analytics
 

Tendances (20)

Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
Cassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comCassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.com
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 

En vedette

Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Spark Summit
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Spark Summit
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Spark Summit
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Spark Summit
 

En vedette (20)

Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
 
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida Ha
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDB
 
Viadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosViadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on Mesos
 
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
 
20 Inspiring Quotes On Customer Service
20 Inspiring Quotes On Customer Service20 Inspiring Quotes On Customer Service
20 Inspiring Quotes On Customer Service
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Production Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlib
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiHow Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
 
Data Scientist Workbench 入門
Data Scientist Workbench 入門Data Scientist Workbench 入門
Data Scientist Workbench 入門
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your applicationSpark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your application
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
 

Similaire à Insights into Customer Behavior from Clickstream Data by Ronald Nowling

Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!
All Things Open
 
Red hat storage el almacenamiento disruptivo
Red hat storage el almacenamiento disruptivoRed hat storage el almacenamiento disruptivo
Red hat storage el almacenamiento disruptivo
Nextel S.A.
 

Similaire à Insights into Customer Behavior from Clickstream Data by Ronald Nowling (20)

2008-07-30 IBM Teach the Teacher (IBM T3), Red Hat Update for System z
2008-07-30 IBM Teach the Teacher (IBM T3), Red Hat Update for System z2008-07-30 IBM Teach the Teacher (IBM T3), Red Hat Update for System z
2008-07-30 IBM Teach the Teacher (IBM T3), Red Hat Update for System z
 
ansible_rhel_90.pdf
ansible_rhel_90.pdfansible_rhel_90.pdf
ansible_rhel_90.pdf
 
Best Red Hat Linux Certification Course
Best Red Hat Linux Certification CourseBest Red Hat Linux Certification Course
Best Red Hat Linux Certification Course
 
Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64 Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64
 
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
PHPFrameworkDay 2020 - Different software evolutions from Start till Release ...
 
"Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa..."Different software evolutions from Start till Release in PHP product" Oleksa...
"Different software evolutions from Start till Release in PHP product" Oleksa...
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures
 
Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!
 
Presentazione Lenovo evento 9 febbraio
Presentazione Lenovo evento 9 febbraioPresentazione Lenovo evento 9 febbraio
Presentazione Lenovo evento 9 febbraio
 
Linux on Azure - Session TechDays 2014 par Blaise Vignon (Microsoft), Julien ...
Linux on Azure - Session TechDays 2014 par Blaise Vignon (Microsoft), Julien ...Linux on Azure - Session TechDays 2014 par Blaise Vignon (Microsoft), Julien ...
Linux on Azure - Session TechDays 2014 par Blaise Vignon (Microsoft), Julien ...
 
Linux: embarquement immédiat pour le cloud
Linux: embarquement immédiat pour le cloudLinux: embarquement immédiat pour le cloud
Linux: embarquement immédiat pour le cloud
 
Startup Engineering Cookbook
Startup Engineering CookbookStartup Engineering Cookbook
Startup Engineering Cookbook
 
Open Source Software – Open Day Oracle 2013
Open Source Software  – Open Day Oracle 2013Open Source Software  – Open Day Oracle 2013
Open Source Software – Open Day Oracle 2013
 
Red hat storage el almacenamiento disruptivo
Red hat storage el almacenamiento disruptivoRed hat storage el almacenamiento disruptivo
Red hat storage el almacenamiento disruptivo
 
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
 
LinuxONE cavemen mmit 20160505 v1.0
LinuxONE cavemen mmit 20160505 v1.0LinuxONE cavemen mmit 20160505 v1.0
LinuxONE cavemen mmit 20160505 v1.0
 
2011-03-15 Lockheed Martin Open Source Day
2011-03-15 Lockheed Martin Open Source Day2011-03-15 Lockheed Martin Open Source Day
2011-03-15 Lockheed Martin Open Source Day
 
[RUHR] DRPSL- Drupal on kubernetes in production. What should you do_know_.pdf
[RUHR] DRPSL- Drupal on kubernetes in production. What should you do_know_.pdf[RUHR] DRPSL- Drupal on kubernetes in production. What should you do_know_.pdf
[RUHR] DRPSL- Drupal on kubernetes in production. What should you do_know_.pdf
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Dev trends 18_q1
Dev trends 18_q1Dev trends 18_q1
Dev trends 18_q1
 

Plus de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 

Dernier (20)

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Insights into Customer Behavior from Clickstream Data by Ronald Nowling

  • 1. Insights into Customer Behavior from Clickstream Data Ronald J. Nowling Red Hat, Inc. rnowling@redhat.com http://rnowling.github.io/
  • 2. Who Am I? •  Software Engineer at Red Hat •  Data Science Team, Emerging Technologies –  Evaluate solutions in open-source Big Data space –  Ensure software works for Red Hat customers –  Promote data science internally through consulting projects •  Apache Bigtop PMC 2  
  • 4. Clickstream Data 61 million page views 4  
  • 5. Clickstream Data 61 million page views 125,000 registered users 5  
  • 6. Clickstream Data 61 million page views 125,000 registered users 500,000 pages 6  
  • 7. Clickstream Data 61 million page views 125,000 registered users 500,000 pages 125,000 knowledgebase articles 7  
  • 8. Potential Applications •  Build customer profiles to aid sales teams •  Recommendation system for knowledgebase •  Improve customer portal search •  Guide selection of new knowledgebase topics by content writers 8  
  • 9. 9   Strip Formatting Clean Words Vectorize Cluster What are the different types of kernel packages in Red Hat Enterprise Linux? ============================================================= Issue ------ What are the different types of kernel packages in Red Hat Enterprise Linux? Environment --------------- Red Hat Enterprise Linux Resolution ------------ Red Hat Enterprise Linux contains the following kernel packages:
  • 10. 10   Strip Formatting Clean Words Vectorize Cluster What are the different types of kernel packages in Red Hat Enterprise Linux Issue What are the different types of kernel packages in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contains the following kernel packages some may not apply to your architecture and not all are available in all major releases kernel contains the kernel and following key features
  • 11. 11   Strip Formatting Clean Words Vectorize Cluster What are the different types of kernel packages in Red Hat Enterprise Linux Issue What are the different types of kernel packages in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contains the following kernel packages some may not apply to your architecture and not all are available in all major releases kernel contains the kernel and following key features
  • 12. 12   Strip Formatting Clean Words Vectorize Cluster What are the different type of kernel package in Red Hat Enterprise Linux Issue What are the different type of kernel package in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contain the follow kernel package some may not apply to your architecture and not all are available in all major release kernel contain the kernel and follow key feature
  • 13. 13   Strip Formatting Clean Words Vectorize Cluster What are the different type of kernel package in Red Hat Enterprise Linux Issue What are the different type of kernel package in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contain the follow kernel package some may not apply to your architecture and not all are available in all major release kernel contain the kernel and follow key feature
  • 14. 14   Strip Formatting Clean Words Vectorize Cluster different type kernel package Red Hat Enterprise Linux Issue different type kernel package Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contain kernel package apply architecture available major release kernel contain kernel follow key feature
  • 15. 15   Strip Formatting Clean Words Vectorize Cluster kernel: 5 red: 4 hat: 4 enterprise: 4 linux: 4 package: 3 contain: 3 different: 2 type: 2 intel: 2 environment: 1 resolution: 1 follow: 1 system: 1
  • 16. 16   Strip Formatting Clean Words Vectorize Cluster kernel: 5 red: 4 hat: 4 enterprise: 4 linux: 4 package: 3 contain: 3 different: 2 type: 2 intel: 2 environment: 1 resolution: 1 follow: 1 system: 1
  • 17. 17   Strip Formatting Clean Words Vectorize Cluster kernel: 5 red: 4 hat: 4 enterprise: 4 linux: 4 package: 3 contain: 3
  • 20. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 20  
  • 21. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 21  
  • 22. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 22  
  • 23. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 23  
  • 24. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 24  
  • 26. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 26  
  • 27. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 27  
  • 28. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 28  
  • 29. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 29  
  • 30. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 30  
  • 31. Customer Profiles •  Dominant topics – JBoss – Red Hat Enterprise Virtualization – Hardware support – Gluster – Booting into rescue mode – Packages 31  
  • 32. Customer Profiles •  Supporting topics – Logging – LDAP – Samba – High resource usage – File systems / LVM / block devices – Networking 32  
  • 33. Customer Profiles •  JBoss and RHEV appear in combination with a number of other products •  Some products only appear by themselves with supporting topics (logging, networking, filesystems) – OpenShift – Gluster 33  
  • 35. Malformed TSV Files •  Gzip files need to be read sequentially •  Tab-separated, no quoting (in theory!) •  Escaped tabs and newlines within records •  E.g., n or t •  Improperly escaped tabs and newlines •  E.g., t vs t •  Extraneous unmatched quote marks •  E.g., ‘some_user 35  
  • 36. Lessons Learned •  Consider custom Hadoop input formats for tricky file formats •  Verify everything – what works in general may not work for you – Stemming – Filtering most frequent words – K-Means vs LDA 36  
  • 37. Lessons Learned •  K-Means – Improve accuracy: Multiple runs, more iterations •  Watch out for memory leaks – Un-persist cached RDDs – Un-persist broadcasted variables •  Parquet for performance 37  
  • 38. Potential Applications •  Build customer profiles to aid sales teams •  Recommendation system for knowledgebase •  Improve customer portal search •  Guide selection of new knowledgebase topics for content writers 38