4. Problem: Product Recommendation
Type of Data Source of Data
Product Information Product Catalogue
Customer Information Customer Data (demographic)
Customer Purchase History Transactional Data (RDBMS and HDFS)
User Activity Tracking Data Third Party Hybrid Cloud Data - Heat map, Click data,
Demographic (Heat map tool: crazyegg, google
analytics etc.); Real Time Bidding (RTB) Ad inventory
data (e.g. Deltax)
User activity log – on premise data e.g. browser
cookie, local storage data
Social Network Activity (Analytics Data about
products, likes, usage, share, no. of participants etc.)
Social Network (Facebook, Twitter, Google+,
Instagram)
5. Is it a Big Data Problem: Product
Recommendation
Remarks
Volume Yes As per the capacity planning, total data generated during 7 years is
5932 TB.
Velocity Yes Speed of data generation and analysis of Transactional and Social
Network activities; speed of capturing browser cookie data and
structured/unstructured data generation and analysis
Variety Yes Use of incompatible and non-integrated data from heterogeneous
sources such as customer purchase data, activity logs, social network
6. ML Approach: Product Recommendation
Machine Learning Problem Reasoning
Unsupervised The problem states that we need to derive product recommendation based on
observed similarity among customer data.
Clustering We will cluster similar customer attributes (browsing patterns, purchase history,
demographic information, and behavioral data) based on observed data set . [K-
means clustering]
Recommendation Within a cluster, we will use user-based collaborative filtering as
recommendation will be driven by customer attributes. [User-based
Collaborative Filtering]
7. Big Data Components: Product Recommendation
Remarks
Hadoop
Distributed File
System (HDFS) -
Primary
Will use to store structured, semi-structured data (activity log,
purchase history, user information, social analytics data etc.) in raw
format
Sqoop Bringing transactional data to hdfs and vice versa
Flume Collecting, aggregating and moving large amount of user activity log
data
Chukwa Get the log data generated from primary HDFS to another HDFS to
analyze
Pig/Hive Help to write Map reduce scripts to get data in key-value structured
format
Mahout/R-
Hadoop
To get product recommendation we can use Mahout’s core algorithm
for clustering, classification and batch based collaborative filtering are
implemented
Zookeeper To monitor some common services like namespaces, configuration
management, synchronization of data and services among
namenodes & datanodes in Hadoop
9. Problem: Demand Analysis and Forecasting
for existing product line
Type of Data Source of Data
Product Information Product Catalogue
Customer Information Customer Data (demographic)
Product Purchase Information, inventory life time,
wish list, product sales volume
Transaction Data (RDBMS and HDFS)
Social Network Activity (Analytics data about
products, likes, usage, share, no. of participants etc.)
Social Network (Facebook, Twitter, Google+,
Instagram)
10. Is it Big Data Problem: Demand Analysis and
Forecasting for existing product line
Remarks
Volume Yes As per the capacity planning, total data generated during 7 years is
5932 TB.
Velocity Yes Speed of data generation and analysis of Transactional and Social
Network activities. speed of capturing product inventory life time,
point of sale (pos), sales volume and structured/unstructured data
generation and analysis
Variety Yes Use of incompatible and non-integrated data from heterogeneous
sources such as customer purchase data, activity logs, social network
11. ML Approach: Demand Analysis and
Forecasting for existing product line
Machine Learning Problem Reasoning
Supervised Our target is to determine the demand of merchandise in the future
Prediction We are predicting the demand of merchandise in the future.
Regression Based on observed data set, we are trying to predict the demand in the future.
We are doing it by establishing correlation between the data set and the
outcome. [ Linear Regression Tree]
Time Series We are trying to establish a continuous time interval pattern of merchandise
demand based on correlation between demand and observed data set. [ARIMA
parametric time series modeling]
12. Big Data Components: Demand Analysis and
Forecasting for existing product line
Remarks
Hadoop Distributed File System
(HDFS) - Primary
Will use to store structured, semi-structured data (purchase history, product
inventory lifetime, wish list, user information, social analytics data etc.) in raw
format
Sqoop Bringing transactional data, product inventory lifetime, pos, wishlist, etc. to hdfs
and vice versa
Flume Collecting, aggregating and moving large amount of product activity log as well as
purchase log information
Chukwa Get the log data generated from primary HDFS to another HDFS to analyze
Pig/Hive Help to write Map reduce scripts to get data in key-value structured format
Mahout/R-Hadoop Time series data consisting of four components - trend, season, cycle and noise.
Need to estimate the trend and seasonal component (Ex:- day of week/month in a
year ), for any specific region or location etc. from the data and use these to forecast
future. ML packages allows for forecasting which are quick and effective in
collaboration.
Zookeeper To monitor some common services like namespaces, configuration management,
synchronization of data and services among namenodes & datanodes in Hadoop
14. Problem: Customer Churn
Type of Data Source of Data
Customer Purchase History Transaction Database
Customer complaints (rating, sentiment score etc.) Complain data (NoSQL, e,g. – Mongodb)
User Activity (Page navigation, Product Catalogue visit) Heat map, Click data, Navigation data, Demographic
(Heat map tool: crazyegg, google analytics etc); Real
Time Bidding (RTB) Ad inventory data (e.g. Deltax)
User Activity (E.g., Wish List, Abandoned Kart) User Activity Logs
Comparative Product Analysis (Reviews, Price, Product
Description etc.)
Thrid Party Vnedor data e.g. Compareraja.in,
compare.buy.hatke.com
Customer Sentiment score Aggregated data from different Social Networks
(Facebook, Twitter, Google+, Instagram)
Customer Loyalty Transaction Database, User Activity Logs
15. Is it a Big Data Problem: Customer Churn?
Remarks
Volume Yes As per the capacity planning, total data generated during 7 years is
5932 TB.
Velocity Yes Speed of data generation and analysis of Transactional, Sentimental
and Social Network activities
Variety Yes Use of incompatible and non-integrated data from heterogeneous
sources such as customer purchase data, activity logs, social network
16. Problem: Customer Churn
Machine Learning Model Reasoning
Supervised Our target is to determine whether a customer will churn or not.
Classification Problem states that whether customer will churn or not. It asks for a
categorical outcome.
Binary Problem states that whether customer will churn or not. [Decision
Tree]
Unbiased Problem states that whether customer will churn or not. The initial
probability of customer churn is equally positive and negative. Hence,
it is under unbiased model. [C5.0]
17. Big Data Components: Customer Churn
Remarks
Hadoop Distributed File System
(HDFS) - Primary
Will use to store structured, semi-structured data (purchase history, activity log,
competitive analysis data, aggregated social data, RTB data etc.) in raw format
Sqoop Bringing transactional data, real time wish list, kart information to HDFS and vice
versa
Flume Collecting, aggregating and moving large amount of product activity log as well as
purchase log information
Chukwa Get the log data generated from primary HDFS to another HDFS to analyze
Pig/Hive Help to write Map reduce scripts to get data in key-value structured format
Mahout/R-Hadoop To predict customer churn we can use Decision Tree / C5.0 algorithm
NLP Toolkit (nltk.org)/IBM
Watson
Can use to parse customer feedback, comments about products to find out
sentimental scoring/insight analysis data and then fed the output to Hadoop
Zookeeper To monitor some common services like namespaces, configuration management,
synchronization of data and services among name nodes & data nodes in Hadoop
18. Product & Service OfferingsCustomer Profile Customer feedback/Social MediaAccount Transactions Customer Service Logs &
Surveys
Marketing Campaigns
Hadoop cluster
HDFS
Big Data Infrastructure Visualization
Analytics Systems
NLP Data Processing
19. Assumptions
Type M(Millions) /MB (Mega
byte
Reference
Baseline Assumptions No of Online Customers
Our Market Share
No of Products
100 M
25 M
12M
http://goo.gl/hHb66n
Assume 25% Share
Problem Space Assumptions Customer ‘s Growth Rate
Growth Rate of Product
Avg Monthly Transactions
Avg Monthly Complaints
40%
15%
9 M
0.12 M
http://goo.gl/pm9ydJ
Avg
http://tinyurl.com/gw9dm43
Assume 0.01%
Data/Infra-structure Avg Customer info size
Avg Complaint info size
Avg Data Node RAM size
Replica Factor
Data Block size
1 MB
0.5MB
8GB
3
128 MB
20. Capacity Planning
Problem
Product Recommendation No. of Data Nodes 23713
RAM Capacity 2145 GB
Demand Forecasting No. of Data Nodes 23713
RAM Capacity 2145 GB
Customer Churn No. of Data Nodes 23724
RAM Capacity 2147 GB
Detailed Planning:
Microsoft Excel
Worksheet