DATA LAKES ON AWS: GOOD, FAST, AND INEXPENSIVE

•

2 j'aime•315 vues

This document discusses using data lakes on AWS for storing and analyzing large amounts of data. It describes how AWS services like S3, Athena, and QuickSight can be used to create a scalable, cost-effective data lake that solves problems of on-premise data warehouses like limited storage, slow hardware, and high costs. The document provides demonstrations of building a simple data lake and visualizing data, and argues that AWS data lakes allow organizations to focus on outcomes rather than infrastructure management.

Logiciels

DATA LAKES ON AWS:  
GOOD, FAST, AND INEXPENSIVE
WELLINGTON AWS USER GROUP
Photo: Frank Kovalchek, http://www.ﬂickr.com/people/72213316@N00

Photo: Frank Kovalchek, http://www.ﬂickr.com/people/72213316@N00
WHAT IS A DATA LAKE?
▸ A network ﬁle share full of spreadsheets is a (bad) data lake
▸ Focused on making it easy to collect large amounts of data
▸ A place to store data in its natural format for future analysis
▸ Instead of Big Design Up Front (BDUF) shifts governance
right in order to remove barriers and empower users.
▸ Accepts in principle that slightly inefﬁcient computer costs
make data scientists more productive.

2 MINUTE DATA LAKE
DEMONSTRATION
Photo: Tim Evanson, https://www.ﬂickr.com/photos/timevanson/

DATA WAREHOUSE PROBLEMS SOLVED BY S3
▸ Dropbox’s distributed storage system  
on IEEE Software Engineering Radio  
(Masses of people, enormous capital, long timeframe)
▸ Running out of space, capacity planning
▸ Slow hardware, unable to drink from the ﬁrehose
▸ Signiﬁcant developer cost and delay  
before data can be analyzed to determine if it is valuable.

▸ Elastic scalability
▸ High Availability
▸ Coupling storage to compute (HDFS)
▸ Hosting and admin cost of running EMR clusters
▸ No need to run your own data dictionary (Hive metabase)
and persist it HA between cluster outages.
▸ No need to run your own security (Apache Ranger)
DATA LAKE PROBLEMS SOLVED BY ATHENA

BUSINESS INTELLIGENCE PROBLEMS SOLVED BY QUICKSIGHT
▸ Performance at scale
▸ High Availability
▸ Hosting and admin cost of running servers

COMPETITORS
▸ Azure has similar offerings
▸ PowerBI is good
▸ Azure Data Lake Analytics differences:
▸ Not elastic
▸ No optimized storage: ORC or parquet
▸ Uses HDFS service, not Blob store

VISUALISATION
DEMONSTRATION
Photo: Geo Swan, https://commons.wikimedia.org/wiki/User:Geo_Swan

UNEVEN COMPARISONS
VS
▸ On premise performance will start slower and scale
poorly
▸ AWS Enterprise support vs ticket logging
▸ High availability, Disaster recovery, backup costs
included
▸ On premise costs escalate rapidly with scale.  
~$1,000,000,000 per petabyte every year

TRUE COSTS OF SERVERS
▸ Servers aren’t being patched
▸ Servers aren’t natively Highly Available
▸ Server backups need to be conﬁgured, and can be
misconﬁgured
▸ Server conﬁguration slows down development
▸ Server performance suffers before scaling
Photo: Micheal Filion, https://www.ﬂickr.com/photos/mike9alive/

THE “PROJECT MANAGEMENT TRIANGLE”
Photo: Kevin Lim, https://www.ﬂickr.com/photos/inju/

CULTURE, AUTOMATION, LEAN, MEASUREMENT
© BrokenSphere / Wikimedia Commons
▸ Not tool specialists - can focus elsewhere
▸ Tool “automates” the hard part of the task
▸ Tool only does the part of the job that has value
▸ Transparency - everyone can see the results

NO ONE WANTS A DRILL
▸ This presentation is about tools, people want outcomes.
▸ Knowing your tools is good,  
making them the focus of your work is wrong.
▸ Providing value with a data lake is about asking the important
questions, and answering those questions accurately.
▸ I strongly recommend asking the correct question over using
the correct tool.
▸ Thinking with Data by Max Shron
Photo: United States Marine Corps.

STEVEN ENSSLEN - AUTOMATION FOR BUSINESS INTELLIGENCE
▸ AWS Certiﬁed Solutions Architect - Professional
▸ Big data and business intelligence consulting 
▸ http://stevenensslen.com
▸ steven@stevenensslen.com

Recommandé

20191112 HAProxy conf 2019 - RTL journey to kubernetes with haproxyVincent Gallissot

Cloud native aarhus #5Kasper Nissen

The Cloud Native StackQAware GmbH

PloneConf2017: serverless python for astronaut safetyChris Shenton

DSD-INT 2015 - Foreshore wave attenuation modelling with Xbeach using EO data...Deltares

Kansas City DC/OS Meetup December 2016DaShaun Carter

Immutable Cloud Infrastruture as Code 101QAware GmbH

Ifgi presentationnoho

Recommandé

20191112 HAProxy conf 2019 - RTL journey to kubernetes with haproxyVincent Gallissot

Cloud native aarhus #5Kasper Nissen

The Cloud Native StackQAware GmbH

PloneConf2017: serverless python for astronaut safetyChris Shenton

DSD-INT 2015 - Foreshore wave attenuation modelling with Xbeach using EO data...Deltares

Kansas City DC/OS Meetup December 2016DaShaun Carter

Immutable Cloud Infrastruture as Code 101QAware GmbH

Ifgi presentationnoho

AWS Storage and Edge ProcessingAmazon Web Services

Overview of AWS Services for Data Storage and Migration - SRV205 - Anaheim AW...Amazon Web Services

SQL Saturday San DiegoKellyn Pot'Vin-Gorman

Cache Sketches: Using Bloom Filters and Web Caching Against Slow Load TimesFelix Gessert

Federated Storage Resources GCC2018 https://vimeo.com/291738189Vahid Jalili

Getting Started with the Hybrid Cloud: Enterprise Backup and RecoveryAmazon Web Services

Using AWS for Backup and Restore (backup in the cloud, backup to the cloud, a...Amazon Web Services

Lunch and Learn - Store and Move your Data To & From the AWS Cloud, Markku Le...Amazon Web Services

Getting Started with the Hybrid Cloud: Enterprise Backup and RecoveryAmazon Web Services

ENT306 Migrating large Scale Data Sets to the CloudAmazon Web Services

Getting Started with the Hybrid Cloud: Enterprise Backup and RecoveryAmazon Web Services

An Overview of AWS Services for Data Storage and Migration - SRV205 - Atlanta...Amazon Web Services

AWS Partner Webcast - Disaster Recovery: Implementing DR Across On-premises a...Amazon Web Services

An Overview of AWS services for Data Storage and Migration - SRV205 - Toronto...Amazon Web Services

Monitoring OpenStack? Piece of cake!Dirk Wallerstorfer

Seamlessly Extend Your Datacenter to the Cloud with Commvault on AWSAmazon Web Services

ppt2.pdfcharusharma165

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Contenu connexe

Similaire à DATA LAKES ON AWS: GOOD, FAST, AND INEXPENSIVE