Big data is one of the most popular terms in the IT industry during the past decade. The word is vague and broad enough that essentially every one of us is living in a big-data world. Every time you do a google search, like a post in Facebook, write something in WeChat or view some item on Amazon, you both use and contribute to someone's big data system. Managing so much data across many computers introduce unique challenges. In this talk, we review the landscape of big data platforms and discuss some lessons we learned from building them.
2. Your Background
Familiar with big-data analytics?
Value = show you what’s “under the hood”.
Familiar with big-data platform?
Mostly review; Value = think about my opinions.
Just curious?
Value = general awareness.
Not interested in big data?
You are in the wrong room.
http://BigAnalyticsPlatform.com 2(C) 2017 Donghui Zhang
3. Disclaimer
The opinions expressed on this site are mine and
do not necessarily represent those of my
employer.
BigAnalyticsPlatform.com is my personal blogging
site. I currently work at Facebook.
3http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
4. Content
Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 4(C) 2017 Donghui Zhang
5. Why Big Data? Data Grows Fast
Data in the world:
10 billion TB
90% was produced in
the last 2 years!
5
Source: Mikal Khoso. “How Much Data is Produced Every Day?”
http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
6. Why Big-Data Platform?
Platform can be a competitive advantage.
Enable junior developers to quickly create robust
applications.
Google thinks of itself as a systems engineering
company.
6
Quote source: Todd Hoff. “Google Architecture”.
http://highscalability.com/google-architecture
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
7. 7
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
8. 8
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
top 3 cloud service
providers
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
9. 9
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Larry Ellison:
“Amazon’s lead is
over”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
10. 10
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Apple “Pie”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
11. 11
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Samsung bought
Joyant
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
12. 12
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Alibaba 2015: 377 sec (3,377 nodes Apsara)
Tencent 2016: 134 sec (512 nodes OpenPower)
Gray sort. See http://sortbenchmark.org
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
13. Content
Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 13(C) 2017 Donghui Zhang
14. What is Big Data?
Big data sets
e.g. “This year our users uploaded 10X more videos; we
have big data now.”
big volume, big variety, or big velocity
exceed existing data processing capabilities
Big data analytics
e.g. “We use big data to predict stock trends.”
Big data stack
software
platform
infrastructure
14http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
15. The Big Data Stack
15
Analytics
Infrastructure Think IaaS such as AWS EC2.
Networked VMs.
Platform Think PaaS such as Google App Engine.
A platform for developing software.
Analytics Software
Think SaaS such as Microsoft Office 365.
Software that Data Scientists can use.
Reports, docs, ad hoc scripts...
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
18. 5 V’s of Big Data
Volume
Variety
Velocity
Veracity
Value
18
5V’s source: Jason Williamson. “The 4 V’s of Big Data”.
http://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
19. 5 V’s of Big Data
Volume
Variety
Velocity
Value
Veracity
19
“Your small data can be my big data!”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
20. 5 V’s of Big Data
Volume
Variety
Velocity
Value
Veracity
20
Lessons
• A key feature missing in RDBMS is
variety.
RDBMS guru: “Put you data in a database!”
Scientist: “My data is not relational.”
RDBMS guru: “Make your data relational!”
Scientist: “But it is not relational!”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
21. 5 V’s of Big Data
Volume
Variety
Velocity
Value
Veracity
21
Streaming.
ETL ELT: Load first, transform later.
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
22. 5 V’s of Big Data
Volume
Variety
Velocity
Value
Veracity
22
Lessons
• Do big data for increasing business
value, not for tech.
• Read a book on building a startup.
http://BigAnalyticsPlatform.com
Source: Frank McSherry. “Scalability! But at what COST?”
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html
If you are going to use a big
data system for yourself,
see if it is faster than your
laptop.
Frank McSherry
(C) 2017 Donghui Zhang
23. 5 V’s of Big Data
Volume
Variety
Velocity
Value
Veracity
23
Source: Philip Russom. “Best Practices for Data Lake Management”.
https://tdwi.org/research/2016/10/checklist-data-lake-management.aspx
Lessons
• Use Data Lakes, not Data Swamps.
• Read Russom’s “Best Practices for Data
lake Management”.
Data scientist: “My analysis suggested
this billion-dollar action.”
Manager: “Where was the data from?”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
24. Content
Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 24(C) 2017 Donghui Zhang
25. Big Data History
25
What goes around
comes around.
Mike Stonebraker
Everything has prior art.
David DeWitt
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
26. Big Data History
1969: relational model (Edgar F. Codd*)
1976: System R by IBM (Jim Gray*; transactions)
1986: Postgres (Mike Stonebraker*; ADT)
1990: Gamma (David DeWitt; shared nothing)
2004: MapReduce (Jeff Dean; flexibility)
2005: “One size doesn’t fit all” (Mike Stonebraker)
2006: Hadoop (Doug Cutting)
2011: Spark (Matei Zaharia)
2017: Death of shared nothing (David DeWitt)
26
* Turing Award Winners (1981, 1998, 2014). http://amturing.acm.org/byyear.cfm
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
27. Big Data History
27
Lessons
• Don’t reinvent the wheels.
• Read the editors’ intro for “the red book”.
• Read "Architecture of a Database System".
• Study favorite posts on HighScalability.
The red book: Bailis, Hellerstein, Stonebraker. “Readings in Database Systems”, 5th Ed.
http://www.redbook.io
HighScalability: http://highscalability.com/all-time-favorites
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
28. Content
Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 28(C) 2017 Donghui Zhang
29. How to Scale to Many Servers?
29
When your data is small
http://BigAnalyticsPlatform.com
clients
server
(C) 2017 Donghui Zhang
30. How to Scale to Many Servers?
30
Use a load balancer
http://BigAnalyticsPlatform.com
clients
LB
servers
(C) 2017 Donghui Zhang
31. How to Scale to Many Servers?
Round-Robin DNS, Point of Presence, multi-level LB.
http://BigAnalyticsPlatform.com 31
LB
clients
servers
POP
POP
POP
POP
POP
(C) 2017 Donghui Zhang
32. Image source: Abhijeet Desai. "Google Cluster Architecture".
http://www.slideshare.net/abhijeetdesai/google-cluster-architecture
Google Cluster at the Beginning
32http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
33. 33
Google Belgium Data Center
Image source: Malte Schwarzkopf. "What does it take to make Google work at scale".
https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
34. 34
Image source: Malte Schwarzkopf. "What does it take to make Google work at scale".
https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0
Google Belgium Data Center
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
35. Google Data Centers
About 40 data centers
About 2 million machines
Machines are organized in containers each having
1,160 machines
30 racks of 40 machines
Sometimes double stacked
35
Data sources:
James Pearn, “How many servers does Google have?”
https://plus.google.com/+JamesPearn/posts/VaQu9sNxJuY
“Learn How Google Works: in Gory Detail”.
http://www.ppcblog.com/how-google-works
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
36. Google Data Size
Data too large
130 trillion pages
Index 100 PB (stacking 2TB drives up: 0.8 mile)
Demand too much
3 billion searches per day (or 35K per second)
36
Data sources:
https://www.google.com/insidesearch/howsearchworks/thestory
http://www.seobook.com/learn-seo/infographics/how-search-works.php
http://www.ppcblog.com/how-google-works
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
37. How to Evaluate a Distributed System
Well-known goals
Useful (solve your business need)
Performant (high throughput, low latency)
Elastic (you may add/remove nodes)
Scalable (adding nodes improves performance)
Fault tolerant (deal with failures)
In addition, I’d advocate
Flexible (scaling, model, interface, architecture)
http://BigAnalyticsPlatform.com 37(C) 2017 Donghui Zhang
38. Shared Nothing Shared Storage
http://BigAnalyticsPlatform.com 38
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
For 30 years, DW were
shared nothing.
Now they are all
shared storage.
Gamma
Teradata
Netezza
Vertica
DB2/PE
SQL Server PDW
Greenplum
Asterdata
SciDB
Redshift Spectrum
Snowflake
Microsoft SQL DW
Google BigQuery
(C) 2017 Donghui Zhang
39. Why Shared Storage? Flexible Scaling!
http://BigAnalyticsPlatform.com 39
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
in minutes
(C) 2017 Donghui Zhang
40. Case Study: Snowflake (flexible scaling)
S3 DATA
STORAGE
COMPUTE
LAYER
VIRTUAL
WAREHOUSE
N
1
N
2
N
3
N
4
CLUSTER OF EC2 INSTANCES
DATA CACHE
VIRTUAL
WAREHOUSE
N
1
N
2
VIRTUAL
WAREHOUSE
N
1
N
2
N
3
N
4
N
5
N
6
N
7
N
8
CLOUD
SERVICES
AUTHENTICATION & ACCESS CONTROL
QUERY
OPTIMIZER
TRANSACTION
MANAGER
INFRASTRUCTURE
MANAGER
SECURITY
METADATA
STORAGE
Database tables stored here
These disks are strictly used as
caches
40
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
41. Case Study: Spark
http://BigAnalyticsPlatform.com 41
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
(C) 2017 Donghui Zhang
42. Case Study: Spark (Flexible Model)
http://BigAnalyticsPlatform.com 42
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
Not only SQL, but also ML, streaming, graph.
(C) 2017 Donghui Zhang
43. Case Study: Spark (Flexible Interface)
http://BigAnalyticsPlatform.com 43
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
You could access Spark using traditional JDBC.
Also, interactive session (in multiple languages).
Also, submit a script as a task.
(C) 2017 Donghui Zhang
44. Case Study: Spark (Flexible Architecture)
http://BigAnalyticsPlatform.com 44
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
May deploy on top of existing YARN or MESOS.
Could also be standalone.
Possible to add components.
(C) 2017 Donghui Zhang
45. How to Evaluate a Distributed System
http://BigAnalyticsPlatform.com 45
Lessons
• Flexibility is an important metric.
• Spark is a flexible system.
• Cloud DW: shared storage.
(C) 2017 Donghui Zhang
In addition to well-known goals
Useful, Performant, Elastic, Scalable, Fault tolerant
I’d advocate
Flexible (scaling, model, interface, architecture)
46. Content
Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 46(C) 2017 Donghui Zhang
47. Growing Need for Big Data Jobs
47
Source: https://www.indeed.com/jobtrends
10X in 5
years
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
48. Big Data Roles
Chief Data Officer
Data Scientist
Data Engineer
Solutions Architect
Big Data Strategist
...... at least 15 more
48
Source: “Top 20 Big Data jobs and their responsibilities”.
http://bigdata-madesimple.com/top-20-big-data-jobs-and-their-responsibilities
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
49. If You Want to Do Analytics
Python
Numpy, Jupyter Notebook
Machine Learning
Scikit-learn
Practice at http://DrivenData.org
http://BigAnalyticsPlatform.com 49(C) 2017 Donghui Zhang
50. If You Want to Do Big Data Platform
Only for senior engineers
Practice at http://LeetCode.com
Embrace open source
Assemble a solution; don’t build from scratch
Consulting business: target medium-sized companies
http://BigAnalyticsPlatform.com 50(C) 2017 Donghui Zhang
51. If You Want to Build A Startup
Read some books about building a startup
Don’t assume you know users’ pain point
Throw away prototype code
Three key people must have good working relationship:
What-To-Do, How-To-Do, and When-To-Do
When in doubt, keep it simple
Strive for a clean API (external and internal)
Do one thing really well first
http://BigAnalyticsPlatform.com 51(C) 2017 Donghui Zhang
52. Stonebraker’s Startup Loop
while (true)
{
1. Talk with users to find their pain;
2. Brainstorm with professors;
3. Recruit students to build a prototype;
4. Draw a quadrant; E.g.
5. Co-found a VC-backed startup;
6. Play banjo; write papers; give talks; receive awards;
}
E.g. Streambase, Vertica, VoltDB, Paradigm4, Tamr, …
E.g. Received ACM Turing Award 2014
52
Small Big
Simple
Complex
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
53. Content
Why
What
History
Technical How-Tos
Career Advice
Conclusions
http://BigAnalyticsPlatform.com 53(C) 2017 Donghui Zhang
54. Conclusions
All “biggies” have big-data platform
Shared nothing shared storage
Leverage on open source: pick/compose/expand
Flexibility is a key metric for distributed systems
http://BigAnalyticsPlatform.com 54(C) 2017 Donghui Zhang