2. About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. Big Data Solution Types
2. Data Pipelines
3. ETL and Visualization
4. Bonus…(if time allows)
4. “What is the ACTUAL Cost of
✘ Saving all Data
✘ Using newer technologies
✘ Going beyond Relational
5. About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. When to use which type of Big Data Solution
2. The new world of Data Pipelines
3. ETL and Visualization Practicalities
4. Bonus…(if time allows)
7. Pattern 1
✘ Which type(s) of Big Data work best?
-- when to use Hadoop
-- when to use NoSQL
and which type, i.e. key-value, document, graph, etc.
-- when to use Big Relational
and what type of workload for hot, warm or cold data
15. Hadoop is your LAST CHOICE
✘ Volume
✘ 10 TB or greater to start
✘ Growth of 25% YOY
✘ Where FROM
✘ Where TO
✘ Velocity and Variety
✘ Spark over HIVE
✘ Kafka and Samsa
✘ Veracity
✘ Pay, train and hire team
✘ Top $$$ for talent
✘ IF you can find it
✘ WATCH OUT for Cloud
Vendors who promise
‘easy access’
✘ Complexity of ecosystem
✘ Cloudera knows best
16. “When do I use…?
✘ Hadoop
✘ NoSQL
✘ Big Relational
20. Key Questions - Storage
✘ Volume – how much now, what growth rate?
✘ Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k-v’, etc…
✘ Velocity – batches, streams, both, what ingest rate?
✘ Veracity – current state (quality) of data, amount of duplication of
data stores, existence of authoritative (master) data management?
21. 21
✘ Open Source is Free ✘ Not Free
§ Rapid iteration, innovation
§ Can start up for free (on premise)
§ Can ‘rent’ for cheap or free on the cloud
§ Can use with the command line for free
§ Some vendors offer free online training
§ Ex. www.neo4j.org
§ Constant releases
§ Can be deceptively hard to set up (time is
money)
§ Don’t forget to turn it off if on the cloud!
§ GUI tools, support, training cost $$$
§ Ex. www.neo4j.com
NoSQL Example
23. NoSQL Applied
Log Files
• ???
Product
Catalogs
• ???
Social
Games
• ???
Social
aggregators
• ???
Line-of-
Business
• ???
24. NoSQL Applied
Log Files
• Columnstore
• HBase
Product
Catalogs
• Key/Value
• Redis
Social
Games
• Document
• MongoDB
Social
aggregators
• Graph
• Neo4j
Line-of-
Business
• RDBMS
• SQL Server
25. More than NoSQL
NoSQL
✘ Non-relational
✘ Can be optimized in-
memory
✘ Eventually consistent
✘ Schema on Read
✘ Example: Aerospike
NewSQL
✘ Relational plus more
✘ Often in-memory
✘ Some kind of SQL-layer
✘ Schema on Write
✘ Example: MemSQL
U-SQL
✘ What???
✘ Microsoft’s universal SQL
language
✘ Example: Azure Data Lake
27. How Best to Store your Data?
Complexity Scalability
Developer
Cost
RDBMS easy medium low
NoSQL medium big high
Hadoop hard huge very high
28. Real World Big Data -- When do I use what?
RDBMS
65%
NoSQL
30%
Hadoop
5%
29. “Do the Cloud Vendors
Understand
Big Data Realities?
30. Cloud Big Data Vendors - Storage
AWS
✘ 5-10X market share of next
competitor
✘ Most complete offering
✘ Most mature offering
✘ Notable: Big Relational
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Requires top developers
✘ Notable: Query as a
Service
Azure
✘ Catching up
✘ Best tooling integration
✘ Notable: On-premise
integration
42. Reasons to use Big Relational Cloud Services
Developers DevOps Cloud Vendors – AWS
Developers DevOps Cloud Vendors – GCP
43. Reasons to use Big Relational Cloud Services
Developers
Most know RDBMS query patterns
Many know basic administration
DevOps
Most know RDBMS administration
Many know basic RDBMS queries
Many know query optimization
Cloud Vendors - AWS
Aurora – RDBMS up to 64 TB
Redshift - $ 1k USD / 1 TB / year
Rich partner ecosystem – ETL
Integration with AWS products
Developers
Most know coding language
patterns to interact with RDBMS
systems
DevOps
Familiar RDBMS security patterns
Familiar auditing
Partner tooling integration
Cloud Vendors - GCP
Big Query – familiar SQL queries
No hassle streaming ingest
No hassle pay-as-you-go
Zero administration
45. ETL is 75% of all Big Data Projects
Surveying, cleaning and loading
data is the majority of the billable
time for new Big Data projects.
46. About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. When to use which type of Big Data Solution
2. The new world of Data Pipelines
3. ETL and Visualization Practicalities
4. Bonus…(if time allows)
48. Pattern 2
✘ How to build optimized cloud-based data pipelines?
-- Cloud-based ETL tools and processes
-- includes load-testing patterns and security practices
-- including connecting between different vendor clouds
49. Key Questions – Ingestion and ETL
✘ Volume – how much and how fast, now and future?
✘ Variety – what type(s) or data, any pre-processing needed?
✘ Velocity – batches or steaming?
✘ Veracity – verification on ingest needed? new data needed?
52. Pipeline Phases
Phase 0
Eval Current Data - Quality & Quantity
Phase 1
Get New Data - Free or Premium
Phase 2
Build MVP & Forecast volume and growth
Phase 3
Load test at scale
Phase 4
Deploy – secure, audit and monitor
53. Cloud Big Data Vendors - ETL
AWS
✘ 5X market share of next
competitor
✘ Notable: Many, strong ETL
Partners
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Notable: DataFlow requires
Java or Python developers
Azure
✘ Difficulty with scale
✘ Best tooling integration
✘ Notable: Nothing
54. How Best to Ingest and ETL your Data?
Complexity Scalability
Developer
Cost
RDBMS medium medium low
NoSQL medium big high
Hadoop hard huge very high
59. Key Questions - Streaming
✘ Volume – how much data now and predicted over next 12 months?
✘ Variety – what types of data now and future?
✘ Velocity – volume of input data / time now and near future?
✘ Veracity – volume of EXISTING data now
60. Cloud Big Data Vendors - Streaming
AWS
✘ 5X market share of next
competitor
✘ Most complete offering
✘ Most mature offering
✘ Notable: Kinesis Firehose
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Requires top developers
✘ Notable: DataFlow flexible
Azure
✘ Catching up
✘ Best tooling integration
✘ Notable: Stream Analytics
integration with other
products
64. Cloud Offerings – Data and Pipelines
AWS Google Microsoft
Managed RDBMS RDS Aurora Cloud SQL Azure SQL
Data Warehouse Redshift BigQuery Azure SQL Data Warehouse
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
Azure Blobs
StorSimple
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
Azure Tables
Streaming or ML Kinesis
AWS Machine Learning
DataFlow
Google Machine Learning
StreamInsight
Azure ML
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
DocumentDB
Neo4j on Azure
Hadoop Elastic MapReduce DataProc Data Lake
HDInsight
Cloud ETL Data Pipelines DataFlow Azure Data Pipeline
65. How Best to Stream your Data?
Complexity Scalability
Developer
Cost
Batches easy medium low
Windows difficult big high
Real-time very difficult huge high
67. Designing Cloud Data Pipelines
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
68. About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. When to use which type of Big Data Solution
2. The new world of Data Pipelines
3. ETL and Visualization Practicalities
4. Bonus…(if time allows)
70. Pattern 3
✘ How best to Query and Visualize
-- When to use business analytics vs. predictive analytics (machine
learning)
-- how best to present data to clients - partner visualization products or
roll your own
75. Cloud Big Data Vendors - Query
AWS
✘ 5X market share of next
competitor
✘ Most complete offering
✘ Most mature offering
✘ Notable: Big Relational
GCP
✘ Lean, mean and cheap
✘ Fastest player
✘ Notable: Flexible, powerful
machine learning
Azure
✘ WATCH OUT – Cost!
✘ Notable: Developer Tooling
76. Query Languages
SQL
Everyone knows it
But how well do they know it?
NoSQL Vendor Language
Too many to list
How will you learn it?
Cypher
Query language for graph
databases
The future?
ORM
Good, bad or horrible?
Again, how well do they know it?
HIVE
Shown in too many vendor demos
Really hard to make performant
Machine Learning Queries
SciPy, NumPy or Python
R Language
Julie Language
Many more…
78. How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer
Cost
RDBMS
NoSQL
Hadoop
79. How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer
Cost
RDBMS easy medium low
NoSQL hard very hard very high
Hadoop hard hard very high
80. Machine Learning aka Predictive Analytics
AWS
ML for developers
GUI-based
GCP
3 Flavors of ML
Python-based languages
Azure
ML for Data Scientists
R Language
82. Dashboards
✘ More than KPIs
✘ Mobile
✘ Alerts
✘ Data Stories
Innovation in Data Visualization
Reports
✘ Level of Detail
✘ Meaningful Taxonomies
✘ Fast enough
✘ Drill for Data
85. Cloud Big Data Vendors - Visualization
AWS
✘ Most complete offering
✘ Notable: Partners &
QuickSight
GCP
✘ Big Query Partners
✘ Notable: New Dashboards
Azure
✘ Integrated
✘ Notable: PowerBI
86. About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. When to use which type of Big Data Solution
2. The new world of Data Pipelines
3. ETL and Visualization Practicalities
4. Bonus…(if time allows)
92. Cloud Big Data Vendors - IoT
AWS
✘ First to market
✘ Most complete offering
✘ Most mature offering
✘ Notable: AWS IoT Rules
GCP
✘ Still in Beta
✘ Fastest player
✘ Requires top developers
✘ Notable: Weave
Azure
✘ Catching up
✘ Best tooling integration
✘ Notable: Device Mgmt.