This document discusses using data lakes on AWS for storing and analyzing large amounts of data. It describes how AWS services like S3, Athena, and QuickSight can be used to create a scalable, cost-effective data lake that solves problems of on-premise data warehouses like limited storage, slow hardware, and high costs. The document provides demonstrations of building a simple data lake and visualizing data, and argues that AWS data lakes allow organizations to focus on outcomes rather than infrastructure management.
3. Photo: Frank Kovalchek, http://www.flickr.com/people/72213316@N00
WHAT IS A DATA LAKE?
▸ A network file share full of spreadsheets is a (bad) data lake
▸ Focused on making it easy to collect large amounts of data
▸ A place to store data in its natural format for future analysis
▸ Instead of Big Design Up Front (BDUF) shifts governance
right in order to remove barriers and empower users.
▸ Accepts in principle that slightly inefficient computer costs
make data scientists more productive.
4. 2 MINUTE DATA LAKE
DEMONSTRATION
Photo: Tim Evanson, https://www.flickr.com/photos/timevanson/
5. DATA WAREHOUSE PROBLEMS SOLVED BY S3
▸ Dropbox’s distributed storage system
on IEEE Software Engineering Radio
(Masses of people, enormous capital, long timeframe)
▸ Running out of space, capacity planning
▸ Slow hardware, unable to drink from the firehose
▸ Significant developer cost and delay
before data can be analyzed to determine if it is valuable.
6. ▸ Elastic scalability
▸ High Availability
▸ Coupling storage to compute (HDFS)
▸ Hosting and admin cost of running EMR clusters
▸ No need to run your own data dictionary (Hive metabase)
and persist it HA between cluster outages.
▸ No need to run your own security (Apache Ranger)
DATA LAKE PROBLEMS SOLVED BY ATHENA
7. BUSINESS INTELLIGENCE PROBLEMS SOLVED BY QUICKSIGHT
▸ Performance at scale
▸ High Availability
▸ Hosting and admin cost of running servers
8. COMPETITORS
▸ Azure has similar offerings
▸ PowerBI is good
▸ Azure Data Lake Analytics differences:
▸ Not elastic
▸ No optimized storage: ORC or parquet
▸ Uses HDFS service, not Blob store
10. UNEVEN COMPARISONS
VS
▸ On premise performance will start slower and scale
poorly
▸ AWS Enterprise support vs ticket logging
▸ High availability, Disaster recovery, backup costs
included
▸ On premise costs escalate rapidly with scale.
~$1,000,000,000 per petabyte every year
11. TRUE COSTS OF SERVERS
▸ Servers aren’t being patched
▸ Servers aren’t natively Highly Available
▸ Server backups need to be configured, and can be
misconfigured
▸ Server configuration slows down development
▸ Server performance suffers before scaling
Photo: Micheal Filion, https://www.flickr.com/photos/mike9alive/
20. NO ONE WANTS A DRILL
▸ This presentation is about tools, people want outcomes.
▸ Knowing your tools is good,
making them the focus of your work is wrong.
▸ Providing value with a data lake is about asking the important
questions, and answering those questions accurately.
▸ I strongly recommend asking the correct question over using
the correct tool.
▸ Thinking with Data by Max Shron
Photo: United States Marine Corps.
21. STEVEN ENSSLEN - AUTOMATION FOR BUSINESS INTELLIGENCE
▸ AWS Certified Solutions Architect - Professional
▸ Big data and business intelligence consulting
▸ http://stevenensslen.com
▸ steven@stevenensslen.com