SlideShare a Scribd company logo
1 of 20
Case Sudy:
Retail In-Store
Analysis with Hadoop
Nils Kübler, YMC
May 13th 2013
CC 2.0 by Franck BLAIS | http://flic.kr/p/cwVnSy
What is the Status
Quo? What could
be possible?
Introduction
Status Quo
What is the KPI in Retail?
→ Revenue/qm2
How to bring in more metrics?
Possibile sensors for a real store:
● customer frequency counters at doors
● the cashier system
● free WiFi access points
● video capturing
● temperature
● ...
For many of these sensors additional Hardware and Software is
needed:
⇒ Let's use the free WIFI access points
What type of Questions could we ask?
● How many people visited the store? → unique visitors?
● How many visits did we have?
● What is the average visit duration?
● How many people are new vs. returning?
● ....
CC 2.0 by by Ian Carroll | http://flic.kr/p/6NWoGm
How do we answer
these questions?
Preparation
Traditional Data Management Approach
From a high level of abstraction the answer is simple. We need a
data management system with three pieces:
1. ingest
2. store
3. process
Blueprint for a Data Management System
with Hadoop
We take this basis architecture and replace the generic terms
while mapping it onto the Hadoop ecosystem.
With this Hadoop architecture a Data Scientist should be able to
answer the questions without any programming environment.
He/she can also use familiar BI, analysis and reporting tools as
well.
CC 2.0 by Perry French | http://flic.kr/p/8wDMJS
What do we need?
Setup
Ingrediants
1. 2 WiFi access points to simulate two different stores
2. Flume to move all log messages to HDFS
3. A 4 node CDH4 cluster
4. Pentaho Data Integration‘s graphical designer for data
transformation, parsing, filtering and loading to the
warehouse
5. Hive as data warehouse system on top of Hadoop to project
structure onto data
6. Impala for querying data from Hive in real time
7. MS Excel to visualize results
● 2 WIFI Routers with OpenWRT installed: one Buffalo and one
Fonera
● Installed 4 Days before the Hackathon, to have some logdata
● Syslogs are collected on Central Syslog Server
● Flume Node collects syslogs and store them on HDFS,
without any manual intervention (no transformation, no
filtering)
● (Flume can also be run as Syslogserver)
Ingest
Parsing, Transformation, Filtering, Load
● Raw Log-Data needs to be transformed to CSV
● Many open-source BI Tools to help with that: Palo, SpargoBI,
Pentaho, Talend
● We used Pentaho
● Design a MapReduce Job for distributed transformation of
the Log-Data with
○ Regular expression to match line and split columns
○ Filter empty Lines
○ UDF to create CSV and Unix Timestamp
● From this data we can easily generate a Hive Schema and
store the data to our Hive Data Warehouse.
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake completed (RSN)
Process
● Data can now be processed either by Hive or Impala
● create intermediate with messages like: login/logout with
visit duration.
● We used Impala to query our data ad-hock for our questions
output:
○ How many people visited the store (unique visitors)?
○ How many visits did we have?
○ What is the average visit duration?
○ How many people are new vs. returning?
● The output was then loaded into Excel to create some nice
Graphs.
CC 2.0 by Qi Wei Fong | http://flic.kr/p/7w8vfq
Now, what did we
get?
Results
Visits for stores Buffalo and Fonera
● about 85% of the visits were detected in the Buffalo store
● about 15% in the Fonera store.
● Is Buffalo Store in a better location?
Unique visitors
● 135 visits in the Buffalo by only 9 unique visitors
● 24 visits in the Fonera store by 5 unique visitors
New vs. returning users
● more returning than new users in both stores
● Fonera didn't see a new visitor over the past four days at all
Visit duration over the past 4 days
● Buffalo has more evenly distributed durations
● Fonera shows some peaks
● visitors tend to stay in shop Buffalo much longer
Conclusion
● Analysing WiFi router log files could be done with a
traditional RDBMS database approach as well.
● Answering such questions based on WiFi router log files can
be done without programming software
● Given the fact that one can quickly ramp up a test cluster
with a few nodes, similar problems can be solved within one
day with a handful of engineers.
● It could be possible to track paths from people based on WiFi
router signals using triangulation.
CC 2.0 by Aurelien Guichard | http://flic.kr/p/cjg9yw
Blog Series:
http://bitly.com/bundles/nkuebler/1
Thank you

More Related Content

Similar to In-Store Analysis with Hadoop

Profoundis - Why OpenERP
Profoundis - Why OpenERPProfoundis - Why OpenERP
Profoundis - Why OpenERP
Arjun Pillai
 

Similar to In-Store Analysis with Hadoop (20)

Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Case Study: In-Store Analysis
Case Study: In-Store AnalysisCase Study: In-Store Analysis
Case Study: In-Store Analysis
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Hadoop on retail
Hadoop on retailHadoop on retail
Hadoop on retail
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
How to do a SAP PI/PO Migration 2019
How to do a SAP PI/PO Migration 2019 How to do a SAP PI/PO Migration 2019
How to do a SAP PI/PO Migration 2019
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
Tweak Geeks #FOS15
Tweak Geeks #FOS15Tweak Geeks #FOS15
Tweak Geeks #FOS15
 
Acquisitie in de bibliotheek - Room for thought
Acquisitie in de bibliotheek - Room for thoughtAcquisitie in de bibliotheek - Room for thought
Acquisitie in de bibliotheek - Room for thought
 
Graphs, parallelism and business cases
 Graphs, parallelism and business cases Graphs, parallelism and business cases
Graphs, parallelism and business cases
 
Graphs, parallelism and business cases
Graphs, parallelism and business casesGraphs, parallelism and business cases
Graphs, parallelism and business cases
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
The Big Data Journey at Connexity - Big Data Day LA 2015
The Big Data Journey at Connexity - Big Data Day LA 2015The Big Data Journey at Connexity - Big Data Day LA 2015
The Big Data Journey at Connexity - Big Data Day LA 2015
 
How automate your SAP PI/PO/CPI and API management processes
How automate your SAP PI/PO/CPI and API management processesHow automate your SAP PI/PO/CPI and API management processes
How automate your SAP PI/PO/CPI and API management processes
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Profoundis - Why OpenERP
Profoundis - Why OpenERPProfoundis - Why OpenERP
Profoundis - Why OpenERP
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
 

More from Swiss Big Data User Group

Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
Swiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
Swiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
Swiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
Swiss Big Data User Group
 

More from Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

In-Store Analysis with Hadoop

  • 1. Case Sudy: Retail In-Store Analysis with Hadoop Nils Kübler, YMC May 13th 2013
  • 2. CC 2.0 by Franck BLAIS | http://flic.kr/p/cwVnSy What is the Status Quo? What could be possible? Introduction
  • 3. Status Quo What is the KPI in Retail? → Revenue/qm2
  • 4. How to bring in more metrics? Possibile sensors for a real store: ● customer frequency counters at doors ● the cashier system ● free WiFi access points ● video capturing ● temperature ● ... For many of these sensors additional Hardware and Software is needed: ⇒ Let's use the free WIFI access points
  • 5. What type of Questions could we ask? ● How many people visited the store? → unique visitors? ● How many visits did we have? ● What is the average visit duration? ● How many people are new vs. returning? ● ....
  • 6. CC 2.0 by by Ian Carroll | http://flic.kr/p/6NWoGm How do we answer these questions? Preparation
  • 7. Traditional Data Management Approach From a high level of abstraction the answer is simple. We need a data management system with three pieces: 1. ingest 2. store 3. process
  • 8. Blueprint for a Data Management System with Hadoop We take this basis architecture and replace the generic terms while mapping it onto the Hadoop ecosystem. With this Hadoop architecture a Data Scientist should be able to answer the questions without any programming environment. He/she can also use familiar BI, analysis and reporting tools as well.
  • 9. CC 2.0 by Perry French | http://flic.kr/p/8wDMJS What do we need? Setup
  • 10. Ingrediants 1. 2 WiFi access points to simulate two different stores 2. Flume to move all log messages to HDFS 3. A 4 node CDH4 cluster 4. Pentaho Data Integration‘s graphical designer for data transformation, parsing, filtering and loading to the warehouse 5. Hive as data warehouse system on top of Hadoop to project structure onto data 6. Impala for querying data from Hive in real time 7. MS Excel to visualize results
  • 11. ● 2 WIFI Routers with OpenWRT installed: one Buffalo and one Fonera ● Installed 4 Days before the Hackathon, to have some logdata ● Syslogs are collected on Central Syslog Server ● Flume Node collects syslogs and store them on HDFS, without any manual intervention (no transformation, no filtering) ● (Flume can also be run as Syslogserver) Ingest
  • 12. Parsing, Transformation, Filtering, Load ● Raw Log-Data needs to be transformed to CSV ● Many open-source BI Tools to help with that: Palo, SpargoBI, Pentaho, Talend ● We used Pentaho ● Design a MapReduce Job for distributed transformation of the Log-Data with ○ Regular expression to match line and split columns ○ Filter empty Lines ○ UDF to create CSV and Unix Timestamp ● From this data we can easily generate a Hive Schema and store the data to our Hive Data Warehouse. 1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port 1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake completed (RSN)
  • 13. Process ● Data can now be processed either by Hive or Impala ● create intermediate with messages like: login/logout with visit duration. ● We used Impala to query our data ad-hock for our questions output: ○ How many people visited the store (unique visitors)? ○ How many visits did we have? ○ What is the average visit duration? ○ How many people are new vs. returning? ● The output was then loaded into Excel to create some nice Graphs.
  • 14. CC 2.0 by Qi Wei Fong | http://flic.kr/p/7w8vfq Now, what did we get? Results
  • 15. Visits for stores Buffalo and Fonera ● about 85% of the visits were detected in the Buffalo store ● about 15% in the Fonera store. ● Is Buffalo Store in a better location?
  • 16. Unique visitors ● 135 visits in the Buffalo by only 9 unique visitors ● 24 visits in the Fonera store by 5 unique visitors
  • 17. New vs. returning users ● more returning than new users in both stores ● Fonera didn't see a new visitor over the past four days at all
  • 18. Visit duration over the past 4 days ● Buffalo has more evenly distributed durations ● Fonera shows some peaks ● visitors tend to stay in shop Buffalo much longer
  • 19. Conclusion ● Analysing WiFi router log files could be done with a traditional RDBMS database approach as well. ● Answering such questions based on WiFi router log files can be done without programming software ● Given the fact that one can quickly ramp up a test cluster with a few nodes, similar problems can be solved within one day with a handful of engineers. ● It could be possible to track paths from people based on WiFi router signals using triangulation.
  • 20. CC 2.0 by Aurelien Guichard | http://flic.kr/p/cjg9yw Blog Series: http://bitly.com/bundles/nkuebler/1 Thank you