Northeastern University class 7250 Big Data Architecture and Governance Assignment work.
Big Data Project proposal by taking the case study of Starbucks
2. “Your brand is what people say about you when you’re not in the room.”
: Jeff Bezos
3. Project Goal and Objectives
Better understand customer sentiment towards Starbucks brand and
services by leveraging social media and customer service email data
and using AI technologies.
Objectives:
• Improve upon products and service delivery based on customer
feedback
• Reflect on effectiveness of marketing campaign
4. Value Proposition
• Cost of NOT implementing:
• Approximate increase in revenue by $52.5 million per Quarter (1% of Quarter
Revenue)
• Strategic Initiative:
• Align with the core value of company: Best Customer Experience
• Help company to adopt AI revolution
• Improve the effectiveness of marketing campaign with Feedback Analytics
5. What are the obstacles to getting there?
• The company does not have/own required technology and resources
for capturing social media data and process them for business value.
• The company does not have right talent with required technical
expertise for such project.
6. Who are the key stakeholders and what are
their roles?
• Enterprise Data Governance Office (Data Owner)
• Data Governance Council (Policy Design and Decision Body)
• Chief Data Officer (Execution & Management Role, Meet Data Compliances)
• Data Steward
• Project Manager (Responsible for timely delivery)
• HR Office
• Talent Management (Allocate Resource and Hire Talent)
• Sales Office
• Provide customer complain email data access
• Marketing Office
• Project Service Consumer
• CEO Office
• Project Service Consumer
8. Starbucks Email Address for Customer
Request/Complain
• Online Email Forms
• Company Information
• Starbucks in the Grocery Aisle
• Nutritional Information
• Starbucks.com Web Site
• Mobile Applications
• In Our Stores
• Starbucks Rewards
• Starbucks Cards
• Security Video Request
• security@starbucks.com
9. Data Volume
Data Sources Data Velocity Average Size Total Size
Twitter 40,000,000 tweets 80 byte 3.2 GB
Facebook 48,000,000 engagement 200 byte 9.6 GB
Instagram 40,000,000 engagement 150 byte 6 GB
Google+ 1,000,000 engagement 90 byte 0.09 GB
Youtube 158,000 engagement 130 byte 0.02 GB
Emails 100,000 engagement 2000 byte 0.2 GB
Total: 19.11 GB /month
Monthly Data Volume:
Pareto 80-20 rule based assumption
14. Vendor Selection Strategy
• Reduce the cost of ownership (Available as cloud services)
• Reliable and industry proven
• Enterprise support availability
• Easy to scale
15. Big Data Technology Stack
Security
Management
Framework
Database
Data Access
Visualization
16. Data Ingestion Queue: Kafka
• Kafka is distributed streaming platform with scalable and fault-
tolerance.
• It is one of the most used open source queue platform for delivering
streaming data with pub-sub model.
17. Data Cleaning and NLP Processing: Spark
• Apache Spark is fast distributed data processing platform.
• It provides inbuilt support and libraries for data processing and
machine learning applications.
• It is industry de-facto tool for big data processing.
18. Data Landing and Storage: Hadoop
• Hadoop is distributed storage and computation framework. It has
grown from minimalistic platform to huge ecosystem of tools on top
of Hadoop.
• It is ideal for big data storage with its distributed file system called
HDFS.
19. Data Archiving: Amazon S3 Storage Block
• Amazon web services provides the storage service called “Simple
Storage Service” or S3.
• The S3 support the storage of huge file systems and industry de-facto
for the purpose of archiving historical data.
• The S3 provides great management interface and data access security.
20. Database Engine: Cassandra
• Apache Cassandra is distributed NoSQL database which is optimized
for extremely fast Analytics Query.
• The company Datastax provides enterprise management platform for
Apache Cassandra.
• Apache Cassandra does not support free style query and database
join but extremely fast in plain query processing and very easy to
scale.
21. Visualization: Qlik View
• The Qlik is one of the market leading visualization tool for analytics
and real time reporting purpose.
22. Monitoring: ELK Stack
• The ELK stack is industry de-facto for log monitoring tool.
• It has elastic search as document database which is built on Lucene
search engine. The document search provides easy keyword search
on logs.
• Kibana is visualization and dashboard tool for Realtime visually
monitoring application as well as infrastructure logs.
23. Management: Oozie & Yarn
• Hadoop YARN help in launching Spark job on multiple node cluster. It
is key to manage the distributed data processing task on spark cluster.
• Oozie is workflow management tool for streamlining different
application for workflow.
24. Security: Apache Ranger
• Apache Ranger is to provide comprehensive security across the
Apache Hadoop ecosystem.
• It provides enhanced support for different authorization methods -
Role based access control, attribute based access control etc on
Hadoop
• It also has centralize auditing of user access and administrative
actions (security related) within all the components of Hadoop.
26. What System does
• Ingestion:
• The Java Streaming App will use stream API of Facebook, Twitter, Google+,
Instagram and Youtube to consume data and put it on Kafka Queue.
• The Java Batch App will import batch of email every night and put them in
Kafka.
• Data Cleaning:
• The spark streaming app will consume data from Kafka queue and process
them for validation and data format. The cleaned data will be dumped into
Hadoop-HDFS cluster.
• NLP Processing:
• The Spark-ML app will read Cleaned data from Hadoop-HDFS. The app
append sentiment score and topic categorization. Finally save result on
Cassandra.
27. What System does
• Data Access
• The Cassandra provides JDBC and REST API to query the sentiment score and
data category analytics.
• The existing visualization products can be used to access data using either
JDBC or REST API for custom business questions and analytics.
• Data Backup and Archiving
• Amazon S3 will be used to do periodic backup and archiving of data.
• Scalability, Monitoring, Availability and Security
• Apache Ranger provides Role based and Group based data access security for
authentication.
• ELK stack is used to monitor logs.
• All platform vendors are selected for their capability of scalability and
availability.
28. Capacity Planning for 6 months
Storage:
• Data size: 120 GB
• After Replication factor = 3 x 120 = 360 GB
• Amazon S3 Standard Storage (under 50 TB cap)
30. Data Backup and Data Archival Strategy
• Data Archival
• Dataset Selection for Archival (we are not going to archive everything)
• Data Archiving Stage: 2 year
• Email Record Retention: 7 years
• Social Media Data Retention: 5 years
• Data Backup
• Data Backup Interval: 24 hr
• Total Data Backup History Retention: 5 (keep last 5 backups)
31. GDPR Compliances Plan
• Data Control
• Data Security
• Right to Erasure
• Risk Mitigation and Due Diligence
• Breach Notification
Responsibility of meeting GDPR compliances is of Chief Data Officer