SlideShare une entreprise Scribd logo
1  sur  54
Donghui Zhang
dzhang@BigAnalyticsPlatform.com
2017-5-4
Host: NECINA DIG
Co-Host: MIT CSSA
Your Background
 Familiar with big-data analytics?
 Value = show you what’s “under the hood”.
 Familiar with big-data platform?
 Mostly review; Value = think about my opinions.
 Just curious?
 Value = general awareness.
 Not interested in big data?
 You are in the wrong room.
http://BigAnalyticsPlatform.com 2(C) 2017 Donghui Zhang
Disclaimer
 The opinions expressed on this site are mine and
do not necessarily represent those of my
employer.
 BigAnalyticsPlatform.com is my personal blogging
site. I currently work at Facebook.
3http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 4(C) 2017 Donghui Zhang
Why Big Data? Data Grows Fast
 Data in the world:
 10 billion TB
 90% was produced in
the last 2 years!
5
Source: Mikal Khoso. “How Much Data is Produced Every Day?”
http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Why Big-Data Platform?
 Platform can be a competitive advantage.
 Enable junior developers to quickly create robust
applications.
 Google thinks of itself as a systems engineering
company.
6
Quote source: Todd Hoff. “Google Architecture”.
http://highscalability.com/google-architecture
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
7
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
8
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
top 3 cloud service
providers
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
9
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Larry Ellison:
“Amazon’s lead is
over”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
10
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Apple “Pie”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
11
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Samsung bought
Joyant
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
12
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Alibaba 2015: 377 sec (3,377 nodes Apsara)
Tencent 2016: 134 sec (512 nodes OpenPower)
Gray sort. See http://sortbenchmark.org
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 13(C) 2017 Donghui Zhang
What is Big Data?
 Big data sets
 e.g. “This year our users uploaded 10X more videos; we
have big data now.”
 big volume, big variety, or big velocity
 exceed existing data processing capabilities
 Big data analytics
 e.g. “We use big data to predict stock trends.”
 Big data stack
 software
 platform
 infrastructure
14http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
The Big Data Stack
15
Analytics
Infrastructure Think IaaS such as AWS EC2.
Networked VMs.
Platform Think PaaS such as Google App Engine.
A platform for developing software.
Analytics Software
Think SaaS such as Microsoft Office 365.
Software that Data Scientists can use.
Reports, docs, ad hoc scripts...
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Google Stack
16
Infrastructure
Platform
Products
Custom-built machines; RedHat Linux
GFS/Colossus, BigTable, Spanner,
MapReduce/Cloud Dataflow, Chubby,
Borg/Omega
search, advertising, gmail, docs, maps,
youtube, cloud platform, …
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Sample Open-Source Stack
17
Infrastructure
Platform
Analytics Software
Analytics
VMs
Spark on YARN with Hive
Tableau, scikit-learn
Python scripts
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Veracity
 Value
18
5V’s source: Jason Williamson. “The 4 V’s of Big Data”.
http://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
19
“Your small data can be my big data!”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
20
Lessons
• A key feature missing in RDBMS is
variety.
RDBMS guru: “Put you data in a database!”
Scientist: “My data is not relational.”
RDBMS guru: “Make your data relational!”
Scientist: “But it is not relational!”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
21
Streaming.
ETL  ELT: Load first, transform later.
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
22
Lessons
• Do big data for increasing business
value, not for tech.
• Read a book on building a startup.
http://BigAnalyticsPlatform.com
Source: Frank McSherry. “Scalability! But at what COST?”
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html
If you are going to use a big
data system for yourself,
see if it is faster than your
laptop.
Frank McSherry
(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
23
Source: Philip Russom. “Best Practices for Data Lake Management”.
https://tdwi.org/research/2016/10/checklist-data-lake-management.aspx
Lessons
• Use Data Lakes, not Data Swamps.
• Read Russom’s “Best Practices for Data
lake Management”.
Data scientist: “My analysis suggested
this billion-dollar action.”
Manager: “Where was the data from?”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 24(C) 2017 Donghui Zhang
Big Data History
25
What goes around
comes around.
Mike Stonebraker
Everything has prior art.
David DeWitt
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Big Data History
 1969: relational model (Edgar F. Codd*)
 1976: System R by IBM (Jim Gray*; transactions)
 1986: Postgres (Mike Stonebraker*; ADT)
 1990: Gamma (David DeWitt; shared nothing)
 2004: MapReduce (Jeff Dean; flexibility)
 2005: “One size doesn’t fit all” (Mike Stonebraker)
 2006: Hadoop (Doug Cutting)
 2011: Spark (Matei Zaharia)
 2017: Death of shared nothing (David DeWitt)
26
* Turing Award Winners (1981, 1998, 2014). http://amturing.acm.org/byyear.cfm
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Big Data History
27
Lessons
• Don’t reinvent the wheels.
• Read the editors’ intro for “the red book”.
• Read "Architecture of a Database System".
• Study favorite posts on HighScalability.
The red book: Bailis, Hellerstein, Stonebraker. “Readings in Database Systems”, 5th Ed.
http://www.redbook.io
HighScalability: http://highscalability.com/all-time-favorites
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 28(C) 2017 Donghui Zhang
How to Scale to Many Servers?
29
 When your data is small
http://BigAnalyticsPlatform.com
clients
server
(C) 2017 Donghui Zhang
How to Scale to Many Servers?
30
 Use a load balancer
http://BigAnalyticsPlatform.com
clients
LB
servers
(C) 2017 Donghui Zhang
How to Scale to Many Servers?
 Round-Robin DNS, Point of Presence, multi-level LB.
http://BigAnalyticsPlatform.com 31
LB
clients
servers
POP
POP
POP
POP
POP
(C) 2017 Donghui Zhang
Image source: Abhijeet Desai. "Google Cluster Architecture".
http://www.slideshare.net/abhijeetdesai/google-cluster-architecture
Google Cluster at the Beginning
32http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
33
Google Belgium Data Center
Image source: Malte Schwarzkopf. "What does it take to make Google work at scale".
https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
34
Image source: Malte Schwarzkopf. "What does it take to make Google work at scale".
https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0
Google Belgium Data Center
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Google Data Centers
 About 40 data centers
 About 2 million machines
 Machines are organized in containers each having
1,160 machines
 30 racks of 40 machines
 Sometimes double stacked
35
Data sources:
James Pearn, “How many servers does Google have?”
https://plus.google.com/+JamesPearn/posts/VaQu9sNxJuY
“Learn How Google Works: in Gory Detail”.
http://www.ppcblog.com/how-google-works
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Google Data Size
 Data too large
 130 trillion pages
 Index 100 PB (stacking 2TB drives up: 0.8 mile)
 Demand too much
 3 billion searches per day (or 35K per second)
36
Data sources:
https://www.google.com/insidesearch/howsearchworks/thestory
http://www.seobook.com/learn-seo/infographics/how-search-works.php
http://www.ppcblog.com/how-google-works
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
How to Evaluate a Distributed System
 Well-known goals
 Useful (solve your business need)
 Performant (high throughput, low latency)
 Elastic (you may add/remove nodes)
 Scalable (adding nodes improves performance)
 Fault tolerant (deal with failures)
 In addition, I’d advocate
 Flexible (scaling, model, interface, architecture)
http://BigAnalyticsPlatform.com 37(C) 2017 Donghui Zhang
Shared Nothing  Shared Storage
http://BigAnalyticsPlatform.com 38
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
For 30 years, DW were
shared nothing.
Now they are all
shared storage.
Gamma
Teradata
Netezza
Vertica
DB2/PE
SQL Server PDW
Greenplum
Asterdata
SciDB
Redshift Spectrum
Snowflake
Microsoft SQL DW
Google BigQuery
(C) 2017 Donghui Zhang
Why Shared Storage? Flexible Scaling!
http://BigAnalyticsPlatform.com 39
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
in minutes
(C) 2017 Donghui Zhang
Case Study: Snowflake (flexible scaling)
S3 DATA
STORAGE
COMPUTE
LAYER
VIRTUAL
WAREHOUSE
N
1
N
2
N
3
N
4
CLUSTER OF EC2 INSTANCES
DATA CACHE
VIRTUAL
WAREHOUSE
N
1
N
2
VIRTUAL
WAREHOUSE
N
1
N
2
N
3
N
4
N
5
N
6
N
7
N
8
CLOUD
SERVICES
AUTHENTICATION & ACCESS CONTROL
QUERY
OPTIMIZER
TRANSACTION
MANAGER
INFRASTRUCTURE
MANAGER
SECURITY
METADATA
STORAGE
Database tables stored here
These disks are strictly used as
caches
40
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Case Study: Spark
http://BigAnalyticsPlatform.com 41
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
(C) 2017 Donghui Zhang
Case Study: Spark (Flexible Model)
http://BigAnalyticsPlatform.com 42
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
 Not only SQL, but also ML, streaming, graph.
(C) 2017 Donghui Zhang
Case Study: Spark (Flexible Interface)
http://BigAnalyticsPlatform.com 43
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
 You could access Spark using traditional JDBC.
 Also, interactive session (in multiple languages).
 Also, submit a script as a task.
(C) 2017 Donghui Zhang
Case Study: Spark (Flexible Architecture)
http://BigAnalyticsPlatform.com 44
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
 May deploy on top of existing YARN or MESOS.
 Could also be standalone.
 Possible to add components.
(C) 2017 Donghui Zhang
How to Evaluate a Distributed System
http://BigAnalyticsPlatform.com 45
Lessons
• Flexibility is an important metric.
• Spark is a flexible system.
• Cloud DW: shared storage.
(C) 2017 Donghui Zhang
 In addition to well-known goals
 Useful, Performant, Elastic, Scalable, Fault tolerant
 I’d advocate
 Flexible (scaling, model, interface, architecture)
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 46(C) 2017 Donghui Zhang
Growing Need for Big Data Jobs
47
Source: https://www.indeed.com/jobtrends
10X in 5
years
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Big Data Roles
 Chief Data Officer
 Data Scientist
 Data Engineer
 Solutions Architect
 Big Data Strategist
 ...... at least 15 more
48
Source: “Top 20 Big Data jobs and their responsibilities”.
http://bigdata-madesimple.com/top-20-big-data-jobs-and-their-responsibilities
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
If You Want to Do Analytics
 Python
 Numpy, Jupyter Notebook
 Machine Learning
 Scikit-learn
 Practice at http://DrivenData.org
http://BigAnalyticsPlatform.com 49(C) 2017 Donghui Zhang
If You Want to Do Big Data Platform
 Only for senior engineers
 Practice at http://LeetCode.com
 Embrace open source
 Assemble a solution; don’t build from scratch
 Consulting business: target medium-sized companies
http://BigAnalyticsPlatform.com 50(C) 2017 Donghui Zhang
If You Want to Build A Startup
 Read some books about building a startup
 Don’t assume you know users’ pain point
 Throw away prototype code
 Three key people must have good working relationship:
What-To-Do, How-To-Do, and When-To-Do
 When in doubt, keep it simple
 Strive for a clean API (external and internal)
 Do one thing really well first
http://BigAnalyticsPlatform.com 51(C) 2017 Donghui Zhang
Stonebraker’s Startup Loop
while (true)
{
1. Talk with users to find their pain;
2. Brainstorm with professors;
3. Recruit students to build a prototype;
4. Draw a quadrant; E.g.
5. Co-found a VC-backed startup;
6. Play banjo; write papers; give talks; receive awards;
}
E.g. Streambase, Vertica, VoltDB, Paradigm4, Tamr, …
E.g. Received ACM Turing Award 2014
52
Small Big
Simple
Complex
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 53(C) 2017 Donghui Zhang
Conclusions
 All “biggies” have big-data platform
 Shared nothing  shared storage
 Leverage on open source: pick/compose/expand
 Flexibility is a key metric for distributed systems
http://BigAnalyticsPlatform.com 54(C) 2017 Donghui Zhang

Contenu connexe

Tendances

Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UK
Ulf Mattsson
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
Rohit Dubey
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
Raul Chong
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
m_hepburn
 

Tendances (20)

Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Big data case study collection
Big data   case study collectionBig data   case study collection
Big data case study collection
 
Maximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data PlatformMaximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data Platform
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UK
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Telco Big Data Workshop Sample
Telco Big Data Workshop SampleTelco Big Data Workshop Sample
Telco Big Data Workshop Sample
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
AI in the Enterprise at Scale
AI in the Enterprise at ScaleAI in the Enterprise at Scale
AI in the Enterprise at Scale
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Service generated big data and big data-as-a-service
Service generated big data and big data-as-a-serviceService generated big data and big data-as-a-service
Service generated big data and big data-as-a-service
 
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse..."Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
 

Similaire à Big Data Platform Landscape by 2017

The Business of Big Data (IA Ventures)
The Business of Big Data (IA Ventures)The Business of Big Data (IA Ventures)
The Business of Big Data (IA Ventures)
Ben Siscovick
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
Jay Gendron
 
Making friends with big data resource links
Making friends with big data resource linksMaking friends with big data resource links
Making friends with big data resource links
Heather Stark
 

Similaire à Big Data Platform Landscape by 2017 (20)

Big Data Trends - WorldFuture 2015 Conference
Big Data Trends - WorldFuture 2015 ConferenceBig Data Trends - WorldFuture 2015 Conference
Big Data Trends - WorldFuture 2015 Conference
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
Big Data for One Big Family
Big Data for One Big FamilyBig Data for One Big Family
Big Data for One Big Family
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
What is AI without Data?
What is AI without Data?What is AI without Data?
What is AI without Data?
 
The Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesThe Business of Big Data - IA Ventures
The Business of Big Data - IA Ventures
 
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in Cloud
 
The Business of Big Data (IA Ventures)
The Business of Big Data (IA Ventures)The Business of Big Data (IA Ventures)
The Business of Big Data (IA Ventures)
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
7 trends-for-big-data
7 trends-for-big-data7 trends-for-big-data
7 trends-for-big-data
 
Big Data - Introduction and Research Topics - for Dutch Kadaster
Big Data - Introduction and Research Topics - for Dutch KadasterBig Data - Introduction and Research Topics - for Dutch Kadaster
Big Data - Introduction and Research Topics - for Dutch Kadaster
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
 
Data Science Popup Austin: Back to The Future for Data and Analytics
Data Science Popup Austin: Back to The Future for Data and AnalyticsData Science Popup Austin: Back to The Future for Data and Analytics
Data Science Popup Austin: Back to The Future for Data and Analytics
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Big Data Analytics Research Report
Big Data Analytics Research ReportBig Data Analytics Research Report
Big Data Analytics Research Report
 
Big data v4.0
Big data v4.0Big data v4.0
Big data v4.0
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Making friends with big data resource links
Making friends with big data resource linksMaking friends with big data resource links
Making friends with big data resource links
 

Dernier

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Dernier (20)

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Big Data Platform Landscape by 2017

  • 2. Your Background  Familiar with big-data analytics?  Value = show you what’s “under the hood”.  Familiar with big-data platform?  Mostly review; Value = think about my opinions.  Just curious?  Value = general awareness.  Not interested in big data?  You are in the wrong room. http://BigAnalyticsPlatform.com 2(C) 2017 Donghui Zhang
  • 3. Disclaimer  The opinions expressed on this site are mine and do not necessarily represent those of my employer.  BigAnalyticsPlatform.com is my personal blogging site. I currently work at Facebook. 3http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 4. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 4(C) 2017 Donghui Zhang
  • 5. Why Big Data? Data Grows Fast  Data in the world:  10 billion TB  90% was produced in the last 2 years! 5 Source: Mikal Khoso. “How Much Data is Produced Every Day?” http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 6. Why Big-Data Platform?  Platform can be a competitive advantage.  Enable junior developers to quickly create robust applications.  Google thinks of itself as a systems engineering company. 6 Quote source: Todd Hoff. “Google Architecture”. http://highscalability.com/google-architecture http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 7. 7 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 8. 8 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms top 3 cloud service providers http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 9. 9 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Larry Ellison: “Amazon’s lead is over” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 10. 10 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Apple “Pie” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 11. 11 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Samsung bought Joyant http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 12. 12 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Alibaba 2015: 377 sec (3,377 nodes Apsara) Tencent 2016: 134 sec (512 nodes OpenPower) Gray sort. See http://sortbenchmark.org http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 13. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 13(C) 2017 Donghui Zhang
  • 14. What is Big Data?  Big data sets  e.g. “This year our users uploaded 10X more videos; we have big data now.”  big volume, big variety, or big velocity  exceed existing data processing capabilities  Big data analytics  e.g. “We use big data to predict stock trends.”  Big data stack  software  platform  infrastructure 14http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 15. The Big Data Stack 15 Analytics Infrastructure Think IaaS such as AWS EC2. Networked VMs. Platform Think PaaS such as Google App Engine. A platform for developing software. Analytics Software Think SaaS such as Microsoft Office 365. Software that Data Scientists can use. Reports, docs, ad hoc scripts... http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 16. Google Stack 16 Infrastructure Platform Products Custom-built machines; RedHat Linux GFS/Colossus, BigTable, Spanner, MapReduce/Cloud Dataflow, Chubby, Borg/Omega search, advertising, gmail, docs, maps, youtube, cloud platform, … http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 17. Sample Open-Source Stack 17 Infrastructure Platform Analytics Software Analytics VMs Spark on YARN with Hive Tableau, scikit-learn Python scripts http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 18. 5 V’s of Big Data  Volume  Variety  Velocity  Veracity  Value 18 5V’s source: Jason Williamson. “The 4 V’s of Big Data”. http://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 19. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 19 “Your small data can be my big data!” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 20. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 20 Lessons • A key feature missing in RDBMS is variety. RDBMS guru: “Put you data in a database!” Scientist: “My data is not relational.” RDBMS guru: “Make your data relational!” Scientist: “But it is not relational!” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 21. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 21 Streaming. ETL  ELT: Load first, transform later. http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 22. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 22 Lessons • Do big data for increasing business value, not for tech. • Read a book on building a startup. http://BigAnalyticsPlatform.com Source: Frank McSherry. “Scalability! But at what COST?” http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html If you are going to use a big data system for yourself, see if it is faster than your laptop. Frank McSherry (C) 2017 Donghui Zhang
  • 23. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 23 Source: Philip Russom. “Best Practices for Data Lake Management”. https://tdwi.org/research/2016/10/checklist-data-lake-management.aspx Lessons • Use Data Lakes, not Data Swamps. • Read Russom’s “Best Practices for Data lake Management”. Data scientist: “My analysis suggested this billion-dollar action.” Manager: “Where was the data from?” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 24. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 24(C) 2017 Donghui Zhang
  • 25. Big Data History 25 What goes around comes around. Mike Stonebraker Everything has prior art. David DeWitt http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 26. Big Data History  1969: relational model (Edgar F. Codd*)  1976: System R by IBM (Jim Gray*; transactions)  1986: Postgres (Mike Stonebraker*; ADT)  1990: Gamma (David DeWitt; shared nothing)  2004: MapReduce (Jeff Dean; flexibility)  2005: “One size doesn’t fit all” (Mike Stonebraker)  2006: Hadoop (Doug Cutting)  2011: Spark (Matei Zaharia)  2017: Death of shared nothing (David DeWitt) 26 * Turing Award Winners (1981, 1998, 2014). http://amturing.acm.org/byyear.cfm http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 27. Big Data History 27 Lessons • Don’t reinvent the wheels. • Read the editors’ intro for “the red book”. • Read "Architecture of a Database System". • Study favorite posts on HighScalability. The red book: Bailis, Hellerstein, Stonebraker. “Readings in Database Systems”, 5th Ed. http://www.redbook.io HighScalability: http://highscalability.com/all-time-favorites http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 28. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 28(C) 2017 Donghui Zhang
  • 29. How to Scale to Many Servers? 29  When your data is small http://BigAnalyticsPlatform.com clients server (C) 2017 Donghui Zhang
  • 30. How to Scale to Many Servers? 30  Use a load balancer http://BigAnalyticsPlatform.com clients LB servers (C) 2017 Donghui Zhang
  • 31. How to Scale to Many Servers?  Round-Robin DNS, Point of Presence, multi-level LB. http://BigAnalyticsPlatform.com 31 LB clients servers POP POP POP POP POP (C) 2017 Donghui Zhang
  • 32. Image source: Abhijeet Desai. "Google Cluster Architecture". http://www.slideshare.net/abhijeetdesai/google-cluster-architecture Google Cluster at the Beginning 32http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 33. 33 Google Belgium Data Center Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0 http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 34. 34 Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0 Google Belgium Data Center http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 35. Google Data Centers  About 40 data centers  About 2 million machines  Machines are organized in containers each having 1,160 machines  30 racks of 40 machines  Sometimes double stacked 35 Data sources: James Pearn, “How many servers does Google have?” https://plus.google.com/+JamesPearn/posts/VaQu9sNxJuY “Learn How Google Works: in Gory Detail”. http://www.ppcblog.com/how-google-works http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 36. Google Data Size  Data too large  130 trillion pages  Index 100 PB (stacking 2TB drives up: 0.8 mile)  Demand too much  3 billion searches per day (or 35K per second) 36 Data sources: https://www.google.com/insidesearch/howsearchworks/thestory http://www.seobook.com/learn-seo/infographics/how-search-works.php http://www.ppcblog.com/how-google-works http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 37. How to Evaluate a Distributed System  Well-known goals  Useful (solve your business need)  Performant (high throughput, low latency)  Elastic (you may add/remove nodes)  Scalable (adding nodes improves performance)  Fault tolerant (deal with failures)  In addition, I’d advocate  Flexible (scaling, model, interface, architecture) http://BigAnalyticsPlatform.com 37(C) 2017 Donghui Zhang
  • 38. Shared Nothing  Shared Storage http://BigAnalyticsPlatform.com 38 Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program For 30 years, DW were shared nothing. Now they are all shared storage. Gamma Teradata Netezza Vertica DB2/PE SQL Server PDW Greenplum Asterdata SciDB Redshift Spectrum Snowflake Microsoft SQL DW Google BigQuery (C) 2017 Donghui Zhang
  • 39. Why Shared Storage? Flexible Scaling! http://BigAnalyticsPlatform.com 39 Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program in minutes (C) 2017 Donghui Zhang
  • 40. Case Study: Snowflake (flexible scaling) S3 DATA STORAGE COMPUTE LAYER VIRTUAL WAREHOUSE N 1 N 2 N 3 N 4 CLUSTER OF EC2 INSTANCES DATA CACHE VIRTUAL WAREHOUSE N 1 N 2 VIRTUAL WAREHOUSE N 1 N 2 N 3 N 4 N 5 N 6 N 7 N 8 CLOUD SERVICES AUTHENTICATION & ACCESS CONTROL QUERY OPTIMIZER TRANSACTION MANAGER INFRASTRUCTURE MANAGER SECURITY METADATA STORAGE Database tables stored here These disks are strictly used as caches 40 Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 41. Case Study: Spark http://BigAnalyticsPlatform.com 41 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script (C) 2017 Donghui Zhang
  • 42. Case Study: Spark (Flexible Model) http://BigAnalyticsPlatform.com 42 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script  Not only SQL, but also ML, streaming, graph. (C) 2017 Donghui Zhang
  • 43. Case Study: Spark (Flexible Interface) http://BigAnalyticsPlatform.com 43 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script  You could access Spark using traditional JDBC.  Also, interactive session (in multiple languages).  Also, submit a script as a task. (C) 2017 Donghui Zhang
  • 44. Case Study: Spark (Flexible Architecture) http://BigAnalyticsPlatform.com 44 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script  May deploy on top of existing YARN or MESOS.  Could also be standalone.  Possible to add components. (C) 2017 Donghui Zhang
  • 45. How to Evaluate a Distributed System http://BigAnalyticsPlatform.com 45 Lessons • Flexibility is an important metric. • Spark is a flexible system. • Cloud DW: shared storage. (C) 2017 Donghui Zhang  In addition to well-known goals  Useful, Performant, Elastic, Scalable, Fault tolerant  I’d advocate  Flexible (scaling, model, interface, architecture)
  • 46. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 46(C) 2017 Donghui Zhang
  • 47. Growing Need for Big Data Jobs 47 Source: https://www.indeed.com/jobtrends 10X in 5 years http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 48. Big Data Roles  Chief Data Officer  Data Scientist  Data Engineer  Solutions Architect  Big Data Strategist  ...... at least 15 more 48 Source: “Top 20 Big Data jobs and their responsibilities”. http://bigdata-madesimple.com/top-20-big-data-jobs-and-their-responsibilities http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 49. If You Want to Do Analytics  Python  Numpy, Jupyter Notebook  Machine Learning  Scikit-learn  Practice at http://DrivenData.org http://BigAnalyticsPlatform.com 49(C) 2017 Donghui Zhang
  • 50. If You Want to Do Big Data Platform  Only for senior engineers  Practice at http://LeetCode.com  Embrace open source  Assemble a solution; don’t build from scratch  Consulting business: target medium-sized companies http://BigAnalyticsPlatform.com 50(C) 2017 Donghui Zhang
  • 51. If You Want to Build A Startup  Read some books about building a startup  Don’t assume you know users’ pain point  Throw away prototype code  Three key people must have good working relationship: What-To-Do, How-To-Do, and When-To-Do  When in doubt, keep it simple  Strive for a clean API (external and internal)  Do one thing really well first http://BigAnalyticsPlatform.com 51(C) 2017 Donghui Zhang
  • 52. Stonebraker’s Startup Loop while (true) { 1. Talk with users to find their pain; 2. Brainstorm with professors; 3. Recruit students to build a prototype; 4. Draw a quadrant; E.g. 5. Co-found a VC-backed startup; 6. Play banjo; write papers; give talks; receive awards; } E.g. Streambase, Vertica, VoltDB, Paradigm4, Tamr, … E.g. Received ACM Turing Award 2014 52 Small Big Simple Complex http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 53. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 53(C) 2017 Donghui Zhang
  • 54. Conclusions  All “biggies” have big-data platform  Shared nothing  shared storage  Leverage on open source: pick/compose/expand  Flexibility is a key metric for distributed systems http://BigAnalyticsPlatform.com 54(C) 2017 Donghui Zhang