In-Store Analysis with Hadoop

Case Sudy:
Retail In-Store
Analysis with Hadoop
Nils Kübler, YMC
May 13th 2013

CC 2.0 by Franck BLAIS | http://flic.kr/p/cwVnSy
What is the Status
Quo? What could
be possible?
Introduction

Status Quo
What is the KPI in Retail?
→ Revenue/qm2

How to bring in more metrics?
Possibile sensors for a real store:
● customer frequency counters at doors
● the cashier system
● free WiFi access points
● video capturing
● temperature
● ...
For many of these sensors additional Hardware and Software is
needed:
⇒ Let's use the free WIFI access points

What type of Questions could we ask?
● How many people visited the store? → unique visitors?
● How many visits did we have?
● What is the average visit duration?
● How many people are new vs. returning?
● ....

CC 2.0 by by Ian Carroll | http://flic.kr/p/6NWoGm
How do we answer
these questions?
Preparation

Traditional Data Management Approach
From a high level of abstraction the answer is simple. We need a
data management system with three pieces:
1. ingest
2. store
3. process

Blueprint for a Data Management System
with Hadoop
We take this basis architecture and replace the generic terms
while mapping it onto the Hadoop ecosystem.
With this Hadoop architecture a Data Scientist should be able to
answer the questions without any programming environment.
He/she can also use familiar BI, analysis and reporting tools as
well.

CC 2.0 by Perry French | http://flic.kr/p/8wDMJS
What do we need?
Setup

Ingrediants
1. 2 WiFi access points to simulate two different stores
2. Flume to move all log messages to HDFS
3. A 4 node CDH4 cluster
4. Pentaho Data Integration‘s graphical designer for data
transformation, parsing, filtering and loading to the
warehouse
5. Hive as data warehouse system on top of Hadoop to project
structure onto data
6. Impala for querying data from Hive in real time
7. MS Excel to visualize results

● 2 WIFI Routers with OpenWRT installed: one Buffalo and one
Fonera
● Installed 4 Days before the Hackathon, to have some logdata
● Syslogs are collected on Central Syslog Server
● Flume Node collects syslogs and store them on HDFS,
without any manual intervention (no transformation, no
filtering)
● (Flume can also be run as Syslogserver)
Ingest

Parsing, Transformation, Filtering, Load
● Raw Log-Data needs to be transformed to CSV
● Many open-source BI Tools to help with that: Palo, SpargoBI,
Pentaho, Talend
● We used Pentaho
● Design a MapReduce Job for distributed transformation of
the Log-Data with
○ Regular expression to match line and split columns
○ Filter empty Lines
○ UDF to create CSV and Unix Timestamp
● From this data we can easily generate a Hive Schema and
store the data to our Hive Data Warehouse.
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake completed (RSN)

Process
● Data can now be processed either by Hive or Impala
● create intermediate with messages like: login/logout with
visit duration.
● We used Impala to query our data ad-hock for our questions
output:
○ How many people visited the store (unique visitors)?
○ How many visits did we have?
○ What is the average visit duration?
○ How many people are new vs. returning?
● The output was then loaded into Excel to create some nice
Graphs.

CC 2.0 by Qi Wei Fong | http://flic.kr/p/7w8vfq
Now, what did we
get?
Results

Visits for stores Buffalo and Fonera
● about 85% of the visits were detected in the Buffalo store
● about 15% in the Fonera store.
● Is Buffalo Store in a better location?

Unique visitors
● 135 visits in the Buffalo by only 9 unique visitors
● 24 visits in the Fonera store by 5 unique visitors

New vs. returning users
● more returning than new users in both stores
● Fonera didn't see a new visitor over the past four days at all

Visit duration over the past 4 days
● Buffalo has more evenly distributed durations
● Fonera shows some peaks
● visitors tend to stay in shop Buffalo much longer

Conclusion
● Analysing WiFi router log files could be done with a
traditional RDBMS database approach as well.
● Answering such questions based on WiFi router log files can
be done without programming software
● Given the fact that one can quickly ramp up a test cluster
with a few nodes, similar problems can be solved within one
day with a handful of engineers.
● It could be possible to track paths from people based on WiFi
router signals using triangulation.

CC 2.0 by Aurelien Guichard | http://flic.kr/p/cjg9yw
Blog Series:
http://bitly.com/bundles/nkuebler/1
Thank you

In-Store Analysis with Hadoop

Recommended

Recommended

More Related Content

Similar to In-Store Analysis with Hadoop

Similar to In-Store Analysis with Hadoop (20)

More from Swiss Big Data User Group

More from Swiss Big Data User Group (20)

Recently uploaded

Recently uploaded (20)

In-Store Analysis with Hadoop