SlideShare a Scribd company logo
1 of 69
Download to read offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Américo de Paula
Solutions Architecture Manager - LATAM
Big Data and Analytics:
Innovating at the Speed of Light
https://www.promptcloud.com
https://john-popelaars.blogspot.com
https://ww.signiant.com
https://www.linkedin.com/pulse/world-today-data-rich-information-poor-guru-p-mohapatra-pmp/
The Diminishing Value of Data
Recent data is highly valuable
 If you act on it in time
 Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
 If you have the means to combine them
Traditional Data Warehousing
Wikipedia: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a
system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are
central repositories of integrated data from one or more disparate sources. They store current and historical data and
are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could
range from annual and quarterly comparisons and trends to detailed daily sales analysis.
Velocity Volume
Variety
Change in nature of data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data
When your data sets become so large and complex
you have to start innovating around how to
collect, store, process, analyze, and share them.
Analytics and Big Data
Technologies and techniques for
working productively with massive
amounts of data at any scale in
either batch or real-time.
The Industry Problem
Growth in Data
(mostly Unstructured)
& Analytics
Average Growth in
Traditional DW
Data
Average IT Budget
The Battle
VS.
Why Big Data?
Security threat detection
User Behavior Analysis
Smart Application (Machine Learning)
Business Intelligence
Fraud detection
Financial Modeling and Forecasting
Spending optimization
Real-time alerting
Get answers faster and be able to ask questions not possible to today.
The Cloud Was Built for Big Data
Big Data was Meant for the Cloud
Big Data Cloud Computing
Variety, volume, and velocity requiring new tools Variety of compute, storage, and networking options
Potentially massive datasets
Massive, virtually unlimited capacity
Iterative, experimental style of data
manipulation and analysis
Iterative, experimental style of IT infrastructure
deployment and usage
At its most efficient with highly variable
workloads
Frequently non-steady-state workloads
with peaks and valleys
Absolute performance not as critical as “time to
results”; shared resources are a bottleneck
Parallel compute projects allow each workgroup to
have more autonomy and get faster results
Elastic and highly scalable
No upfront capital expense
Only pay for what you use
+
+
Available on-demand
+
= the Cloud removes constraints
The AWS Approach
• Flexible - Use the best tool for the job
• Data structure, latency, throughput, access patterns
• Low Cost - Big data ≠ big cost
• Scalable – Data should be immutable (append-only)
• Batch/speed/serving layer
• Minimize Admin Overhead - Leverage AWS managed services
• No or very low admin
• Be Agile – Fail fast, test more, optimize Big Data at a lower cost
Starting small is powerful,
when you can scale up fast
Scaling up your analytics systems With AWS Traditional IT *
Get a new BI server 20 minutes Weeks to Months
Upgrade your analytics server to the newest
Intel processors and add 16GB memory
15 minutes Weeks
Add 500TB of storage Instant Weeks to Months
Grow a DWH cluster from 8GB to 1PB 1 hour Several Months
Build a 1024-node Hadoop cluster 30 minutes Unlikely
Roll out multi-region production environment Hours Months
* actual provisioning times in a well-organized IT division
AWS Big Data Platform
EMR EC2
Glacier
S3
Import Export
Kinesis
Direct Connect
Machine LearningRedshift
DynamoDB
AWS Database
Migration Service
Collect Orchestrate Store Analyze
AWS Lambda
AWS IoT
AWS Data Pipeline
Amazon Kinesis
Analytics
Amazon
SNS
AWSSnowball
Amazon
SWF
Amazon Athena
Amazon
QuickSight
Amazon AuroraAWS Glue
Optimal Combinations of Interoperable Services
Amazon Redshift Amazon Elastic
MapReduce
Data Warehouse Semi-structured
Amazon
Glacier
Amazon Simple
Storage Service
Data Storage Archive
Amazon
DynamoDB
Amazon
Machine
Learning
Amazon Kinesis
NoSQL Predictive Models Other AppsStreaming
Sample Reference Architecture: Data Lake
Kinesis Firehose
Athena
Query Service
Processing & Analytics
Real-time Batch
AI & Predictive
BI & Data Visualization
Transactional &
RDBMS
AWS Lambda
Apache Storm
on EMR
Apache Flink
on EMR
Spark Streaming
on EMR
Elasticsearch
Service
Kinesis Analytics,
Kinesis Streams
DynamoDB
NoSQL DB Relational Database
Aurora
EMR
Hadoop, Spark,
Presto
Redshift
Data Warehouse
Athena
Query Service
Amazon Lex
Speech
recognition
Amazon
Rekognition
Amazon Polly
Text to speech
Machine Learning
Predictive analytics
Kinesis Streams
& Firehose
Data Lakes with new tools
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensor
s
Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
Who consumes Analytics
Business Users
For making strategic decisions
e.g. reports like YoY growth of
sales
Data Scientists
To identify models for futuristic
analytics of data
e.g. typically long running ad-
hoc queries on data
Developers
To clean and process application
data using Big Data Jobs
e.g. Processing incoming click
stream data
Consumers
Real time actions and
intelligence on
consumption. Ex. Spend
patterns in banks or
telecom
New
1.2TB/Day logs
30TB /Day data
250 Hadoop Jobs
75Billion transactions/Day
5 Petabytes of Data
A few AWS customer on Big Data / Data Lakes
25 PB Data Warehouse
on Amazon S3
> 1PB read each day
Netflix Uses S3 to Back its Various Clusters
S3
Why Amazon S3 for Big Data?
• Scalable
• Virtually Unlimited number of objects
• Very high bandwidth – no aggregate throughput limit
• Cost-Effective:
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot Instances
• Tiered storage(Standard, IA, Amazon Glacier) via life-cycle policy
•Flexible Access
• Direct access by big data frameworks (Spark, Hive, Presto)
• Shared access: Multiple (Spark, Hive, Presto) clusters can use the same data
NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S
IN MARKET CAP REPRESENTING
WORTH $9.6TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDSMORE THAN U.S.
1 TRILLIONNATIONAL VALUE IS TIED
TO OUR LIBRARY OF MORE THAN
41,000 GLOBAL INDEXES
N A S D A Q T E C H N O L O G Y
IS USED TO POWER MORE THAN
IN 50 COUNTRIES
100 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION
MESSAGES/SECOND
AT SUB-40 MICROSECONDS
AV E R A G E S P E E D S
1 C L E A R I N G H O U S E
WE OWN AND OPERATE
26 MARKETS
5 CENTRAL SECURITIES
DEPOSITORIES
INCLUDING
A C R O S S A S S E T CL A S SE S
& GEOGRAPHIES
• Nasdaq implements an S3 data lake + Redshift data warehouse
architecture
• Most recent two years of data is kept in the Redshift data warehouse
and snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
• Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon Redshift &
S3 data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set (~1100
tables)
Our vision is to be earth’s most customer-centric company;
to build a place where people can come to find and discover
anything they might want to buy online.
Amazon
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Data Warehouse
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Helps to Run the Amazon Business
• Most Comprehensive Set of Cleansed and Curated Business Data
• Feeds Many Downstream Systems and Processes
• Batch Processing, Reporting and Ad Hoc
• 500k+ Data Loads/Transformations Each Day
• 200k+ Queries/Extracts Each Day
• 20k+ Active Tables
• 10B++ Rows Loaded Daily
Our Data is Big!
• Core Data Set: 5+PB of Compressed Data (primarily limited by Legacy Technology)
• Total Storage (Multiple Systems): 35+ PB compressed
• Quote from Executive at Legacy DW Vendor:
• ~1000x Larger than any other DW Customer (from that Vendor)
Significant and Increasing Use of Redshift and EMR
• 1000’s of Redshift and EMR Systems, Range in size from:
• Individual Contributor - Project Based, to
• Running Multi-Billion Dollar Business inside Amazon
The Amazon Enterprise Data Warehouse
The Good!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Who are we?
• Analytics on the “Marketplace”
• Analytics Spokes: Pricing, B2B, Seller Support, Lending …
• Business Scale:
• 235MM monthly CPU Minutes on Legacy ODW
• 2K upstream tables
• Users:
• Supports 170 teams
• 1000 users with 9527 profiles (Parameterized Queries)
• 20K unique job runs per month
• 2800 (800 TB) datasets
• BI Tool Users:
• 3000+ Users, 650 non-tech
• 600+ ”Dashboards”
• 100k’s of queries each month
Example of an Amazon DW “Customer” Team
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.“Swiss Army” by by Jim Pennucci. No alterations other than cropping. https://www.flickr.com/photos/pennuja/5363518281/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
To Provide an analytic ecosystem that Scales with the
Amazon Business
To Leverage AWS Technologies and to help Improve these
technologies for all Amazon Customers
To Provide Choice and Options in New Analytic
Technologies
• Provide an SQL based solution
• Increasingly Focus on Enabling new analytic approaches
including Machine Learning and Programmatic Data
Analysis
• Enable both “Bring Your Own Cluster” and “Bring your
Own Query” Approaches
What is the Goal?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Tools #2” by Juan Pablo Olmo. No alterations other than cropping. https://www.flickr.com/photos/juanpol/1562101472/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License (https://creativecommons.org/licenses/by/2.0/)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR
(running Hive, Pig,
Spark, Presto, etc…)
Amazon DynamoDB
Amazon
Machine Learning
Amazon QuickSight
Amazon RDS
Amazon Elasticsearch
Service
Amazon Redshift Amazon Athena
Amazon SQS
Amazon Kinesis
Analytics
Amazon Kinesis
Firehose
Amazon S3
Amazon Kinesis
Open-source tools
(e.g. for ML, data science)
Commercial tools
Moving Forward - AWS
Amazon
Redshift
S3 / EDX - Separate
Storage from Compute by
leveraging a parallel file
system as a global data
exchange
• Redshift - Preferred
platform SQL based
Analysis and traditional
Data Warehouse Data
• Focus is “Business Users”
• EMR – Scalable “Do
Everything” Platform - Enable
Teams who have chosen EMR
by providing Curated Data
• Focus is “Programattic Access”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Amazon “Data Lake” – Project Name “Andes”
The Goal: ”THE” Place for Data at Amazon
• Source teams (Data Producers) put their Public Data there to give access to Analytic
teams (Data Consumers) and to share private data within their team
• EMR Can Directly Access the Data in Parallel from Andes
• Redshift can load the data in Parallel from Andes, or it Can Directly Access the Data in
Parallel with Spectrum
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Datamarts”
Number of Teams using the DW: ~2300
Number of Tables Used per Team:
• Max: 598
• Min 1
• Average: 49
Ad-Hoc (any data any time) can be achieved via
EMR can access the Data in Andes Directly
Redshift can load data into the Redshift file
system, or it can use the Spectrum Feature to
directly access the Data in Andes
An Architecture that Scales with the Business
Amazon Internal Team (132 Tables)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Putting The Pieces Together
The Analytic Architecture of the Future
Source
Systems
The Data Lake
“Andes”
Big Data Systems
Data Warehouses
“Bring Your Own Cluster” and
“Bring Your Own Query”
Services and Users
Postgre SQL
instance
Amazon
Redshift
Amazon
Redshift
Amazon
Redshift
Amazon
Kinesis
AWS Glue Amazon
QuickSight
Amazon
Athena
Amazon Machine
Learning
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Battle for the Future
The Data Lake becomes the
common source for all data:
The DW becomes the
compute engine for
traditional structured data
(Redshift)
EMR becomes the compute
engine for programmatic
access, like machine
learning and many emerging
use cases
Both become a form of a
Dependent data mart with
the data coming from the
Data Lake
Vs.
AND
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
44
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Purchase
Contract
seller buyer
45
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Table Subscriptions - The Vision
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Subscription
“Big Data Technologies” Team
producer consumer
47
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
48
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Andes – Current State
• We have the data!
• 20k+ Tables maintained in Andes – All Active Tables have
been Sourced from the Enterprise Data Warehouse
• Many teams are adding new data sets!
• Have Onboarded 900+ Redshift and EMR systems to Subscriptions
• 20,000+ tables being synchronized
• Usage off the Legacy DW
• Three years (2014-2016) to grow from 0 to 100k Jobs each
Day
• In 2017, has grown from 100k to 300k Jobs each Day Amazon.com
Big Data
Technologies
FINRA’s Managed Data Lake
“For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We
can do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case Study: Re-architecting Compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20m annually by using AWS
Fraud Detection
FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion
trading events per day and securely store over 5 petabytes of data,
attaining savings of $10-20mm per year.
Big Data & Analytics - Innovating at the Speed of Light
Validation
Prepare for
Analytics
(ETL)
Run Automated
Detection
Models
Interactive
Analytics
Regulatory
Analyst
Explore
Investigate
Regulatory
Follow-up
BDs Exchanges Reference
Data Providers
Trade Execution Records
Market Reference Data
Data
Scientist
Develop
Models
Mar k et R egulation — Analytic s Pipeline
Keeping track of 40M+ tables can be a challenge…
What data do we have?
Where is the data used?
What is the source of this data?
How many versions of this data exist?
What is the retention policy?
Data availability and analytics is complex
Business
Analysts
Data
Scientists Data
Analysts
Data
Engineers
What data do we have?
What format is it in?
Where to I get it?
Get this data for them…
Not on disk—pull from tape
Prepare & Format
Oops, I need more data … Repeat!
I need data in different format …
Repeat!
etc…, etc…
Infrastructure can be limiting & costly
Does not scale well as volumes and workloads
increase
Duplication of effort in data management (data
lifecycle, retention, versioning)
Data sync issues—manual effort to keep data in
sync
Challenges to run analytics across fragmented
data
Costly system maintenance and upgrades
Key pr inc iples of our big data ar c hitec ture
• Separate storage and compute
• Register and track all data in our data catalog
• Keep all versions of each data set
• Protect the data—encrypt at rest and in transit
• Partition data for extra performance
• Backup to another region for business continuity
• Optimize storage and processing costs
Catalog for centralized data management
finraos.github.io/herd
Unified catalog
• Schemas
• Versions
• Encryption type
• Storage policies
Lineage and Usage
• Track publishers and consumers
• Easily identify jobs and derived data sets
Shared Metastore
• Common definition of tables and partitions
• Use with Spark, Presto, Hive, and so on
• Faster instantiation of clusters
FINRA’s AWS Architecture
3
INTAKE MANAGEMENT ANALYTICS
Validation
Normalization
Linkage
Amazon GlacierAmazon S3
Machine Learning
Amazon EMR
Amazon Redshift
text text
API API
 Structured &
Unstructured Data
 Millions of documents
 25K data checks daily
 Normalization
 33,000 Servers Daily
 Centralized Data
 Normalized Data
 Integrated Data
 Discoverable
 Direct Data Query
 ML/AI Platforms
 Applications/ Visualizations
Exchange Data
 12 Equities Markets
 4 Options Markets
SIP Data
 SIP trades
 SIP NBBO
 OPRA
Broker Dealer data
 4000 plus firms
Third Party Data
 Bloomberg
 Thomson Reuters
 DTCC
 OCC
Machine Learning
Amazon EMR
Amazon Redshift
Amazon GlacierAmazon S3
KMS
IAM
RDS
Leverage the Data: Apps, Query, Machine Learning
Data Lake
Audit Trail
Market
Surveillance
Ad-Hoc
Lifecycle
Viewer
App: Powerful UI;
billions of rows of
market tx data
Pattern detection
models and
execution
Investigation
and data profiling
through SQL
Retrieve market
events to render
order lifecycle
Data Science
Best of breed
tools, machine
learning
Enabling Data Science
Data
Scientist
Ad-hoc
Logical ‘Database’
EMR Cluster
Still one copy
of data!
Spark Cluster
DS-in-a-box
AuthN
Data
Scientist
Notebook
Data
Scientist
Catalog
IDE
Universal Data Science Platform (UDSP)
• Environment (EC2) for each
Data Scientist
• Simple provisioning interface
• Right instance (memory or
GPU) for job
• Access to all the data in
Data Lake
• Shut off when not using for
savings
• Secure (LDAP AuthN/Z +
Encryption)
Data
Scientist
UDSP – Inventory – not just R
• R 3.2.5, Python (2.7.12 and 3.4.3)
• Packages
• R: 300+ Python: 100+
• Tools for Building Packages
• gcc, gfortran, make, java, maven, ant…
• IDEs
• Jupyter, RStudio Server
• Deep Learning
• CUDA, CuDNN (if GPU present)
• Theano, Caffe, Torch
• TensorFlow
16
FIN R A U s age Statis tic s on AW S
 33k+ Amazon EC2 nodes
per day
 93%+ of EC2 usage is EMR
based (mostly SPOT)
 20Pb+ storage (Amazon
S3, Amazon Glacier)
13
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
40289 41770
40512
36589
33275
16023
8710
2145 2323
2542
2363
2363
1686
1590
231 231 2…
231
231
231
231
Hadoop/Spark Web, App & RDS Redshift
Node Distribution for May 6-12 (~33k/day)
Achieve Dynamic processing
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
11/1 11/8 11/15 11/22 11/29
Daily Order Volume (Billions)
0
2000
4000
6000
8000
10000
12000
ComputeNodes
Hour of Day
AWS EMR compute on EC2
EMR
20k – 25k EC2 nodes per day 93% of EC2 is on EMR
Avg EC2 node: 3 cores
Avg EC2 uptime: 3 hours
96% of EC2 nodes live < 24 hrsOver 50k nodes on peak day
Query Table size
(rows)
Output
size (rows)
ORC TXT/BZ2
select count(*) from TABLE_1
where trade_date = cast(‘2016-08-09’ as date)
2469171608 1 4s 1m56s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 12 3s 1m51s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 8364 5s 2m5s
select * from TABLE_1 where col2 = cast('2016-08-10' as
date) and col3='I' and col4='CR' and col5 between 100000.0
and 103000.0
2469171608 760 10s 2m3s
Test Config:
Presto 0.167.0.6t (Teradata) On EMR
Data on S3 (external tables)
Cluster size: 60 worker node x r4.4xlarge
Key points:
Use ORC (Or Parquet) for performant query
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ac hieving Inter ac tive Quer y Speed
Benefits We’ve Seen
 Analysts can now interactively
analyze 1000x more market
events (billons vs million
rows)
 Querying order route detail
went from 10s of minutes to
seconds
 Quicker turnaround to provide
data for
 Machine Learning model
development is easier
Analytics ResiliencyAgility
 Easily reprocess data …
used to take weeks to find
capacity now can be done in
day/days
 Cloud makes it very easy to
share (even large) data sets
with third parties in Cloud
 Can perform model (pattern)
reruns in days not weeks
 Market volume changes no
longer disruptive events
 Improved system uptime vs
in-house
At TCO 30% less expensive than with our data center
Analytics
Analytics On 450k
Subscribers Using Amazon
Redshift
Ad Campaign Effectiveness
Analysis Platform
Financial Simulations
Platform
Trading History
Clickstream Data
From 300 Websites
DNA Sequencing
Américo de Paula
Solutions Architecture Manager
americop@amazon.com
Worldwide | N. America | LATAM | UK/IR | EMEA | APAC | Japan | China

More Related Content

What's hot

Track 1 Session 5_數位創新 市場資料雲端分析與應用(new).pptx
Track 1 Session 5_數位創新  市場資料雲端分析與應用(new).pptxTrack 1 Session 5_數位創新  市場資料雲端分析與應用(new).pptx
Track 1 Session 5_數位創新 市場資料雲端分析與應用(new).pptxAmazon Web Services
 
Innovation with AWS on : Big Data Analytics
Innovation with AWS on : Big Data AnalyticsInnovation with AWS on : Big Data Analytics
Innovation with AWS on : Big Data AnalyticsAmazon Web Services
 
How Fannie Mae Processes over a Quarter Million Loans per Day with Amazon S3 ...
How Fannie Mae Processes over a Quarter Million Loans per Day with Amazon S3 ...How Fannie Mae Processes over a Quarter Million Loans per Day with Amazon S3 ...
How Fannie Mae Processes over a Quarter Million Loans per Day with Amazon S3 ...Amazon Web Services
 
Successful Cloud Adoption in Financial Services
Successful Cloud Adoption in Financial ServicesSuccessful Cloud Adoption in Financial Services
Successful Cloud Adoption in Financial ServicesAmazon Web Services
 
BI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSBI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSAmazon Web Services
 
SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingAmazon Web Services
 
AWS Summit - Atlanta
AWS Summit - Atlanta AWS Summit - Atlanta
AWS Summit - Atlanta Sandy Carter
 
AWS Summit Singapore - Core FSI workloads on the cloud
AWS Summit Singapore - Core FSI workloads on the cloudAWS Summit Singapore - Core FSI workloads on the cloud
AWS Summit Singapore - Core FSI workloads on the cloudAmazon Web Services
 
Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構
Track 3 Session 2_從傳統  legacy  邁向數位化與現代化架構Track 3 Session 2_從傳統  legacy  邁向數位化與現代化架構
Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構Amazon Web Services
 
AWS Cloud Adoption and the Future of Financial Services
AWS Cloud Adoption and the Future of Financial ServicesAWS Cloud Adoption and the Future of Financial Services
AWS Cloud Adoption and the Future of Financial ServicesAmazon Web Services
 
AWS Financial Services Cloud Symposium - Opening & Welcome
AWS Financial Services Cloud Symposium - Opening & WelcomeAWS Financial Services Cloud Symposium - Opening & Welcome
AWS Financial Services Cloud Symposium - Opening & WelcomeAmazon Web Services
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Amazon Web Services
 
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...Amazon Web Services
 

What's hot (20)

CurrencyCloud and AWS
CurrencyCloud and AWSCurrencyCloud and AWS
CurrencyCloud and AWS
 
Track 1 Session 5_數位創新 市場資料雲端分析與應用(new).pptx
Track 1 Session 5_數位創新  市場資料雲端分析與應用(new).pptxTrack 1 Session 5_數位創新  市場資料雲端分析與應用(new).pptx
Track 1 Session 5_數位創新 市場資料雲端分析與應用(new).pptx
 
Innovation with AWS on : Big Data Analytics
Innovation with AWS on : Big Data AnalyticsInnovation with AWS on : Big Data Analytics
Innovation with AWS on : Big Data Analytics
 
How Fannie Mae Processes over a Quarter Million Loans per Day with Amazon S3 ...
How Fannie Mae Processes over a Quarter Million Loans per Day with Amazon S3 ...How Fannie Mae Processes over a Quarter Million Loans per Day with Amazon S3 ...
How Fannie Mae Processes over a Quarter Million Loans per Day with Amazon S3 ...
 
Successful Cloud Adoption in Financial Services
Successful Cloud Adoption in Financial ServicesSuccessful Cloud Adoption in Financial Services
Successful Cloud Adoption in Financial Services
 
AWS Analytics Experience Argentina - Intro
AWS Analytics Experience Argentina - IntroAWS Analytics Experience Argentina - Intro
AWS Analytics Experience Argentina - Intro
 
BI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSBI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWS
 
SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data Warehousing
 
Defining Your Cloud Strategy
Defining Your Cloud StrategyDefining Your Cloud Strategy
Defining Your Cloud Strategy
 
Cost Optimization on AWS
Cost Optimization on AWSCost Optimization on AWS
Cost Optimization on AWS
 
AWS Summit - Atlanta
AWS Summit - Atlanta AWS Summit - Atlanta
AWS Summit - Atlanta
 
AWS Summit Singapore - Core FSI workloads on the cloud
AWS Summit Singapore - Core FSI workloads on the cloudAWS Summit Singapore - Core FSI workloads on the cloud
AWS Summit Singapore - Core FSI workloads on the cloud
 
AWS in Financial Services
AWS in Financial ServicesAWS in Financial Services
AWS in Financial Services
 
New Tools for a New World
New Tools for a New WorldNew Tools for a New World
New Tools for a New World
 
Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構
Track 3 Session 2_從傳統  legacy  邁向數位化與現代化架構Track 3 Session 2_從傳統  legacy  邁向數位化與現代化架構
Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構
 
AWS Cloud Adoption and the Future of Financial Services
AWS Cloud Adoption and the Future of Financial ServicesAWS Cloud Adoption and the Future of Financial Services
AWS Cloud Adoption and the Future of Financial Services
 
AWS Financial Services Cloud Symposium - Opening & Welcome
AWS Financial Services Cloud Symposium - Opening & WelcomeAWS Financial Services Cloud Symposium - Opening & Welcome
AWS Financial Services Cloud Symposium - Opening & Welcome
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
 
Moving forward with AI
Moving forward with AIMoving forward with AI
Moving forward with AI
 
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
Leadership Session: Cloud Adoption and the Future of Financial Services (FSV2...
 

Similar to Big Data & Analytics - Innovating at the Speed of Light

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analyticsAmazon Web Services
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesAmazon Web Services
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesAmazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and TableauAnalyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and TableauDATAVERSITY
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
 
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...Amazon Web Services
 
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...Amazon Web Services
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsAmazon Web Services
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 

Similar to Big Data & Analytics - Innovating at the Speed of Light (20)

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and TableauAnalyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
 
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 

More from Amazon Web Services LATAM

AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvemAWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvemAmazon Web Services LATAM
 
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e BackupAWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e BackupAmazon Web Services LATAM
 
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.Amazon Web Services LATAM
 
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvemAWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvemAmazon Web Services LATAM
 
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e BackupAWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e BackupAmazon Web Services LATAM
 
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.Amazon Web Services LATAM
 
Automatice el proceso de entrega con CI/CD en AWS
Automatice el proceso de entrega con CI/CD en AWSAutomatice el proceso de entrega con CI/CD en AWS
Automatice el proceso de entrega con CI/CD en AWSAmazon Web Services LATAM
 
Automatize seu processo de entrega de software com CI/CD na AWS
Automatize seu processo de entrega de software com CI/CD na AWSAutomatize seu processo de entrega de software com CI/CD na AWS
Automatize seu processo de entrega de software com CI/CD na AWSAmazon Web Services LATAM
 
Ransomware: como recuperar os seus dados na nuvem AWS
Ransomware: como recuperar os seus dados na nuvem AWSRansomware: como recuperar os seus dados na nuvem AWS
Ransomware: como recuperar os seus dados na nuvem AWSAmazon Web Services LATAM
 
Ransomware: cómo recuperar sus datos en la nube de AWS
Ransomware: cómo recuperar sus datos en la nube de AWSRansomware: cómo recuperar sus datos en la nube de AWS
Ransomware: cómo recuperar sus datos en la nube de AWSAmazon Web Services LATAM
 
Aprenda a migrar y transferir datos al usar la nube de AWS
Aprenda a migrar y transferir datos al usar la nube de AWSAprenda a migrar y transferir datos al usar la nube de AWS
Aprenda a migrar y transferir datos al usar la nube de AWSAmazon Web Services LATAM
 
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWS
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWSAprenda como migrar e transferir dados ao utilizar a nuvem da AWS
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWSAmazon Web Services LATAM
 
Cómo mover a un almacenamiento de archivos administrados
Cómo mover a un almacenamiento de archivos administradosCómo mover a un almacenamiento de archivos administrados
Cómo mover a un almacenamiento de archivos administradosAmazon Web Services LATAM
 
Os benefícios de migrar seus workloads de Big Data para a AWS
Os benefícios de migrar seus workloads de Big Data para a AWSOs benefícios de migrar seus workloads de Big Data para a AWS
Os benefícios de migrar seus workloads de Big Data para a AWSAmazon Web Services LATAM
 

More from Amazon Web Services LATAM (20)

AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvemAWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
 
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e BackupAWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
 
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
 
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvemAWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
 
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e BackupAWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
 
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
 
Automatice el proceso de entrega con CI/CD en AWS
Automatice el proceso de entrega con CI/CD en AWSAutomatice el proceso de entrega con CI/CD en AWS
Automatice el proceso de entrega con CI/CD en AWS
 
Automatize seu processo de entrega de software com CI/CD na AWS
Automatize seu processo de entrega de software com CI/CD na AWSAutomatize seu processo de entrega de software com CI/CD na AWS
Automatize seu processo de entrega de software com CI/CD na AWS
 
Cómo empezar con Amazon EKS
Cómo empezar con Amazon EKSCómo empezar con Amazon EKS
Cómo empezar con Amazon EKS
 
Como começar com Amazon EKS
Como começar com Amazon EKSComo começar com Amazon EKS
Como começar com Amazon EKS
 
Ransomware: como recuperar os seus dados na nuvem AWS
Ransomware: como recuperar os seus dados na nuvem AWSRansomware: como recuperar os seus dados na nuvem AWS
Ransomware: como recuperar os seus dados na nuvem AWS
 
Ransomware: cómo recuperar sus datos en la nube de AWS
Ransomware: cómo recuperar sus datos en la nube de AWSRansomware: cómo recuperar sus datos en la nube de AWS
Ransomware: cómo recuperar sus datos en la nube de AWS
 
Ransomware: Estratégias de Mitigação
Ransomware: Estratégias de MitigaçãoRansomware: Estratégias de Mitigação
Ransomware: Estratégias de Mitigação
 
Ransomware: Estratégias de Mitigación
Ransomware: Estratégias de MitigaciónRansomware: Estratégias de Mitigación
Ransomware: Estratégias de Mitigación
 
Aprenda a migrar y transferir datos al usar la nube de AWS
Aprenda a migrar y transferir datos al usar la nube de AWSAprenda a migrar y transferir datos al usar la nube de AWS
Aprenda a migrar y transferir datos al usar la nube de AWS
 
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWS
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWSAprenda como migrar e transferir dados ao utilizar a nuvem da AWS
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWS
 
Cómo mover a un almacenamiento de archivos administrados
Cómo mover a un almacenamiento de archivos administradosCómo mover a un almacenamiento de archivos administrados
Cómo mover a un almacenamiento de archivos administrados
 
Simplifique su BI con AWS
Simplifique su BI con AWSSimplifique su BI con AWS
Simplifique su BI con AWS
 
Simplifique o seu BI com a AWS
Simplifique o seu BI com a AWSSimplifique o seu BI com a AWS
Simplifique o seu BI com a AWS
 
Os benefícios de migrar seus workloads de Big Data para a AWS
Os benefícios de migrar seus workloads de Big Data para a AWSOs benefícios de migrar seus workloads de Big Data para a AWS
Os benefícios de migrar seus workloads de Big Data para a AWS
 

Recently uploaded

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 

Recently uploaded (20)

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 

Big Data & Analytics - Innovating at the Speed of Light

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Américo de Paula Solutions Architecture Manager - LATAM Big Data and Analytics: Innovating at the Speed of Light
  • 3. The Diminishing Value of Data Recent data is highly valuable  If you act on it in time  Perishable Insights (M. Gualtieri, Forrester) Old + Recent data is more valuable  If you have the means to combine them
  • 4. Traditional Data Warehousing Wikipedia: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could range from annual and quarterly comparisons and trends to detailed daily sales analysis.
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 7. Big Data When your data sets become so large and complex you have to start innovating around how to collect, store, process, analyze, and share them.
  • 8. Analytics and Big Data Technologies and techniques for working productively with massive amounts of data at any scale in either batch or real-time.
  • 9. The Industry Problem Growth in Data (mostly Unstructured) & Analytics Average Growth in Traditional DW Data Average IT Budget
  • 11. Why Big Data? Security threat detection User Behavior Analysis Smart Application (Machine Learning) Business Intelligence Fraud detection Financial Modeling and Forecasting Spending optimization Real-time alerting Get answers faster and be able to ask questions not possible to today.
  • 12. The Cloud Was Built for Big Data
  • 13. Big Data was Meant for the Cloud Big Data Cloud Computing Variety, volume, and velocity requiring new tools Variety of compute, storage, and networking options Potentially massive datasets Massive, virtually unlimited capacity Iterative, experimental style of data manipulation and analysis Iterative, experimental style of IT infrastructure deployment and usage At its most efficient with highly variable workloads Frequently non-steady-state workloads with peaks and valleys Absolute performance not as critical as “time to results”; shared resources are a bottleneck Parallel compute projects allow each workgroup to have more autonomy and get faster results
  • 14. Elastic and highly scalable No upfront capital expense Only pay for what you use + + Available on-demand + = the Cloud removes constraints
  • 15. The AWS Approach • Flexible - Use the best tool for the job • Data structure, latency, throughput, access patterns • Low Cost - Big data ≠ big cost • Scalable – Data should be immutable (append-only) • Batch/speed/serving layer • Minimize Admin Overhead - Leverage AWS managed services • No or very low admin • Be Agile – Fail fast, test more, optimize Big Data at a lower cost
  • 16. Starting small is powerful, when you can scale up fast Scaling up your analytics systems With AWS Traditional IT * Get a new BI server 20 minutes Weeks to Months Upgrade your analytics server to the newest Intel processors and add 16GB memory 15 minutes Weeks Add 500TB of storage Instant Weeks to Months Grow a DWH cluster from 8GB to 1PB 1 hour Several Months Build a 1024-node Hadoop cluster 30 minutes Unlikely Roll out multi-region production environment Hours Months * actual provisioning times in a well-organized IT division
  • 17. AWS Big Data Platform EMR EC2 Glacier S3 Import Export Kinesis Direct Connect Machine LearningRedshift DynamoDB AWS Database Migration Service Collect Orchestrate Store Analyze AWS Lambda AWS IoT AWS Data Pipeline Amazon Kinesis Analytics Amazon SNS AWSSnowball Amazon SWF Amazon Athena Amazon QuickSight Amazon AuroraAWS Glue
  • 18. Optimal Combinations of Interoperable Services Amazon Redshift Amazon Elastic MapReduce Data Warehouse Semi-structured Amazon Glacier Amazon Simple Storage Service Data Storage Archive Amazon DynamoDB Amazon Machine Learning Amazon Kinesis NoSQL Predictive Models Other AppsStreaming
  • 19. Sample Reference Architecture: Data Lake Kinesis Firehose Athena Query Service
  • 20. Processing & Analytics Real-time Batch AI & Predictive BI & Data Visualization Transactional & RDBMS AWS Lambda Apache Storm on EMR Apache Flink on EMR Spark Streaming on EMR Elasticsearch Service Kinesis Analytics, Kinesis Streams DynamoDB NoSQL DB Relational Database Aurora EMR Hadoop, Spark, Presto Redshift Data Warehouse Athena Query Service Amazon Lex Speech recognition Amazon Rekognition Amazon Polly Text to speech Machine Learning Predictive analytics Kinesis Streams & Firehose
  • 21. Data Lakes with new tools Relational and non-relational data TBs-EBs scale Schema defined during analysis Diverse analytical engines to gain insights Designed for low-cost storage and analytics OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 100110000100101011100 101010111001010100001 011111011010 0011110010110010110 0100011000010 Devices Web Sensor s Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time
  • 22. Who consumes Analytics Business Users For making strategic decisions e.g. reports like YoY growth of sales Data Scientists To identify models for futuristic analytics of data e.g. typically long running ad- hoc queries on data Developers To clean and process application data using Big Data Jobs e.g. Processing incoming click stream data Consumers Real time actions and intelligence on consumption. Ex. Spend patterns in banks or telecom New
  • 23. 1.2TB/Day logs 30TB /Day data 250 Hadoop Jobs 75Billion transactions/Day 5 Petabytes of Data A few AWS customer on Big Data / Data Lakes 25 PB Data Warehouse on Amazon S3 > 1PB read each day
  • 24. Netflix Uses S3 to Back its Various Clusters S3
  • 25. Why Amazon S3 for Big Data? • Scalable • Virtually Unlimited number of objects • Very high bandwidth – no aggregate throughput limit • Cost-Effective: • No need to run compute clusters for storage (unlike HDFS) • Can run transient Hadoop clusters & Amazon EC2 Spot Instances • Tiered storage(Standard, IA, Amazon Glacier) via life-cycle policy •Flexible Access • Direct access by big data frameworks (Spark, Hive, Presto) • Shared access: Multiple (Spark, Hive, Presto) clusters can use the same data
  • 26. NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S IN MARKET CAP REPRESENTING WORTH $9.6TRILLION DIVERSE INDUSTRIES AND MANY OF THE WORLD’S MOST WELL-KNOWN AND INNOVATIVE BRANDSMORE THAN U.S. 1 TRILLIONNATIONAL VALUE IS TIED TO OUR LIBRARY OF MORE THAN 41,000 GLOBAL INDEXES N A S D A Q T E C H N O L O G Y IS USED TO POWER MORE THAN IN 50 COUNTRIES 100 MARKETPLACES OUR GLOBAL PLATFORM CAN HANDLE MORE THAN 1 MILLION MESSAGES/SECOND AT SUB-40 MICROSECONDS AV E R A G E S P E E D S 1 C L E A R I N G H O U S E WE OWN AND OPERATE 26 MARKETS 5 CENTRAL SECURITIES DEPOSITORIES INCLUDING A C R O S S A S S E T CL A S SE S & GEOGRAPHIES
  • 27. • Nasdaq implements an S3 data lake + Redshift data warehouse architecture • Most recent two years of data is kept in the Redshift data warehouse and snapshotted into S3 for disaster recovery • Data between two and five years old is kept in S3 • Presto on EMR is used to ad-hoc query data in S3 • Transitioned from an on-premises data warehouse to Amazon Redshift & S3 data lake architecture • Over 1,000 tables migrated • Average daily ingest of over 7B rows • Migrated off legacy DW to AWS (start to finish) in 7 man-months • AWS costs were 43% of legacy budget for the same data set (~1100 tables)
  • 28. Our vision is to be earth’s most customer-centric company; to build a place where people can come to find and discover anything they might want to buy online. Amazon
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Helps to Run the Amazon Business • Most Comprehensive Set of Cleansed and Curated Business Data • Feeds Many Downstream Systems and Processes • Batch Processing, Reporting and Ad Hoc • 500k+ Data Loads/Transformations Each Day • 200k+ Queries/Extracts Each Day • 20k+ Active Tables • 10B++ Rows Loaded Daily Our Data is Big! • Core Data Set: 5+PB of Compressed Data (primarily limited by Legacy Technology) • Total Storage (Multiple Systems): 35+ PB compressed • Quote from Executive at Legacy DW Vendor: • ~1000x Larger than any other DW Customer (from that Vendor) Significant and Increasing Use of Redshift and EMR • 1000’s of Redshift and EMR Systems, Range in size from: • Individual Contributor - Project Based, to • Running Multi-Billion Dollar Business inside Amazon The Amazon Enterprise Data Warehouse The Good!
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Who are we? • Analytics on the “Marketplace” • Analytics Spokes: Pricing, B2B, Seller Support, Lending … • Business Scale: • 235MM monthly CPU Minutes on Legacy ODW • 2K upstream tables • Users: • Supports 170 teams • 1000 users with 9527 profiles (Parameterized Queries) • 20K unique job runs per month • 2800 (800 TB) datasets • BI Tool Users: • 3000+ Users, 650 non-tech • 600+ ”Dashboards” • 100k’s of queries each month Example of an Amazon DW “Customer” Team
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.“Swiss Army” by by Jim Pennucci. No alterations other than cropping. https://www.flickr.com/photos/pennuja/5363518281/ Image used with permissions under Creative Commons license 2.0, Attribution Generic License
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. To Provide an analytic ecosystem that Scales with the Amazon Business To Leverage AWS Technologies and to help Improve these technologies for all Amazon Customers To Provide Choice and Options in New Analytic Technologies • Provide an SQL based solution • Increasingly Focus on Enabling new analytic approaches including Machine Learning and Programmatic Data Analysis • Enable both “Bring Your Own Cluster” and “Bring your Own Query” Approaches What is the Goal?
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Tools #2” by Juan Pablo Olmo. No alterations other than cropping. https://www.flickr.com/photos/juanpol/1562101472/ Image used with permissions under Creative Commons license 2.0, Attribution Generic License (https://creativecommons.org/licenses/by/2.0/)
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR (running Hive, Pig, Spark, Presto, etc…) Amazon DynamoDB Amazon Machine Learning Amazon QuickSight Amazon RDS Amazon Elasticsearch Service Amazon Redshift Amazon Athena Amazon SQS Amazon Kinesis Analytics Amazon Kinesis Firehose Amazon S3 Amazon Kinesis Open-source tools (e.g. for ML, data science) Commercial tools
  • 38. Moving Forward - AWS Amazon Redshift S3 / EDX - Separate Storage from Compute by leveraging a parallel file system as a global data exchange • Redshift - Preferred platform SQL based Analysis and traditional Data Warehouse Data • Focus is “Business Users” • EMR – Scalable “Do Everything” Platform - Enable Teams who have chosen EMR by providing Curated Data • Focus is “Programattic Access”
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Amazon “Data Lake” – Project Name “Andes” The Goal: ”THE” Place for Data at Amazon • Source teams (Data Producers) put their Public Data there to give access to Analytic teams (Data Consumers) and to share private data within their team • EMR Can Directly Access the Data in Parallel from Andes • Redshift can load the data in Parallel from Andes, or it Can Directly Access the Data in Parallel with Spectrum
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Datamarts” Number of Teams using the DW: ~2300 Number of Tables Used per Team: • Max: 598 • Min 1 • Average: 49 Ad-Hoc (any data any time) can be achieved via EMR can access the Data in Andes Directly Redshift can load data into the Redshift file system, or it can use the Spectrum Feature to directly access the Data in Andes An Architecture that Scales with the Business Amazon Internal Team (132 Tables)
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Putting The Pieces Together The Analytic Architecture of the Future Source Systems The Data Lake “Andes” Big Data Systems Data Warehouses “Bring Your Own Cluster” and “Bring Your Own Query” Services and Users Postgre SQL instance Amazon Redshift Amazon Redshift Amazon Redshift Amazon Kinesis AWS Glue Amazon QuickSight Amazon Athena Amazon Machine Learning
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Battle for the Future The Data Lake becomes the common source for all data: The DW becomes the compute engine for traditional structured data (Redshift) EMR becomes the compute engine for programmatic access, like machine learning and many emerging use cases Both become a form of a Dependent data mart with the data coming from the Data Lake Vs. AND
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 44
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Purchase Contract seller buyer 45
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Table Subscriptions - The Vision
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Subscription “Big Data Technologies” Team producer consumer 47
  • 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 48
  • 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Andes – Current State • We have the data! • 20k+ Tables maintained in Andes – All Active Tables have been Sourced from the Enterprise Data Warehouse • Many teams are adding new data sets! • Have Onboarded 900+ Redshift and EMR systems to Subscriptions • 20,000+ tables being synchronized • Usage off the Legacy DW • Three years (2014-2016) to grow from 0 to 100k Jobs each Day • In 2017, has grown from 100k to 300k Jobs each Day Amazon.com Big Data Technologies
  • 50. “For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.” - Steve Randich, CIO Case Study: Re-architecting Compliance What FINRA needed • Infrastructure for its market surveillance platform • Support of analysis and storage of approximately 75 billion market events every day Why they chose AWS • Fulfillment of FINRA’s security requirements • Ability to create a flexible platform using dynamic clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3 Benefits realized • Increased agility, speed, and cost savings • Estimated savings of $10-20m annually by using AWS
  • 51. Fraud Detection FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion trading events per day and securely store over 5 petabytes of data, attaining savings of $10-20mm per year.
  • 53. Validation Prepare for Analytics (ETL) Run Automated Detection Models Interactive Analytics Regulatory Analyst Explore Investigate Regulatory Follow-up BDs Exchanges Reference Data Providers Trade Execution Records Market Reference Data Data Scientist Develop Models Mar k et R egulation — Analytic s Pipeline
  • 54. Keeping track of 40M+ tables can be a challenge… What data do we have? Where is the data used? What is the source of this data? How many versions of this data exist? What is the retention policy?
  • 55. Data availability and analytics is complex Business Analysts Data Scientists Data Analysts Data Engineers What data do we have? What format is it in? Where to I get it? Get this data for them… Not on disk—pull from tape Prepare & Format Oops, I need more data … Repeat! I need data in different format … Repeat! etc…, etc…
  • 56. Infrastructure can be limiting & costly Does not scale well as volumes and workloads increase Duplication of effort in data management (data lifecycle, retention, versioning) Data sync issues—manual effort to keep data in sync Challenges to run analytics across fragmented data Costly system maintenance and upgrades
  • 57. Key pr inc iples of our big data ar c hitec ture • Separate storage and compute • Register and track all data in our data catalog • Keep all versions of each data set • Protect the data—encrypt at rest and in transit • Partition data for extra performance • Backup to another region for business continuity • Optimize storage and processing costs
  • 58. Catalog for centralized data management finraos.github.io/herd Unified catalog • Schemas • Versions • Encryption type • Storage policies Lineage and Usage • Track publishers and consumers • Easily identify jobs and derived data sets Shared Metastore • Common definition of tables and partitions • Use with Spark, Presto, Hive, and so on • Faster instantiation of clusters
  • 59. FINRA’s AWS Architecture 3 INTAKE MANAGEMENT ANALYTICS Validation Normalization Linkage Amazon GlacierAmazon S3 Machine Learning Amazon EMR Amazon Redshift text text API API  Structured & Unstructured Data  Millions of documents  25K data checks daily  Normalization  33,000 Servers Daily  Centralized Data  Normalized Data  Integrated Data  Discoverable  Direct Data Query  ML/AI Platforms  Applications/ Visualizations Exchange Data  12 Equities Markets  4 Options Markets SIP Data  SIP trades  SIP NBBO  OPRA Broker Dealer data  4000 plus firms Third Party Data  Bloomberg  Thomson Reuters  DTCC  OCC Machine Learning Amazon EMR Amazon Redshift Amazon GlacierAmazon S3 KMS IAM RDS
  • 60. Leverage the Data: Apps, Query, Machine Learning Data Lake Audit Trail Market Surveillance Ad-Hoc Lifecycle Viewer App: Powerful UI; billions of rows of market tx data Pattern detection models and execution Investigation and data profiling through SQL Retrieve market events to render order lifecycle Data Science Best of breed tools, machine learning
  • 61. Enabling Data Science Data Scientist Ad-hoc Logical ‘Database’ EMR Cluster Still one copy of data! Spark Cluster DS-in-a-box AuthN Data Scientist Notebook Data Scientist Catalog IDE
  • 62. Universal Data Science Platform (UDSP) • Environment (EC2) for each Data Scientist • Simple provisioning interface • Right instance (memory or GPU) for job • Access to all the data in Data Lake • Shut off when not using for savings • Secure (LDAP AuthN/Z + Encryption) Data Scientist
  • 63. UDSP – Inventory – not just R • R 3.2.5, Python (2.7.12 and 3.4.3) • Packages • R: 300+ Python: 100+ • Tools for Building Packages • gcc, gfortran, make, java, maven, ant… • IDEs • Jupyter, RStudio Server • Deep Learning • CUDA, CuDNN (if GPU present) • Theano, Caffe, Torch • TensorFlow 16
  • 64. FIN R A U s age Statis tic s on AW S  33k+ Amazon EC2 nodes per day  93%+ of EC2 usage is EMR based (mostly SPOT)  20Pb+ storage (Amazon S3, Amazon Glacier) 13 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 40289 41770 40512 36589 33275 16023 8710 2145 2323 2542 2363 2363 1686 1590 231 231 2… 231 231 231 231 Hadoop/Spark Web, App & RDS Redshift Node Distribution for May 6-12 (~33k/day)
  • 65. Achieve Dynamic processing 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 11/1 11/8 11/15 11/22 11/29 Daily Order Volume (Billions) 0 2000 4000 6000 8000 10000 12000 ComputeNodes Hour of Day AWS EMR compute on EC2 EMR 20k – 25k EC2 nodes per day 93% of EC2 is on EMR Avg EC2 node: 3 cores Avg EC2 uptime: 3 hours 96% of EC2 nodes live < 24 hrsOver 50k nodes on peak day
  • 66. Query Table size (rows) Output size (rows) ORC TXT/BZ2 select count(*) from TABLE_1 where trade_date = cast(‘2016-08-09’ as date) 2469171608 1 4s 1m56s select col1, count(*) from TABLE_1 where col2 = cast('2016- 08-09' as date) group by col1 order by col1 2469171608 12 3s 1m51s select col1, count(*) from TABLE_1 where col2 = cast('2016- 08-09' as date) group by col1 order by col1 2469171608 8364 5s 2m5s select * from TABLE_1 where col2 = cast('2016-08-10' as date) and col3='I' and col4='CR' and col5 between 100000.0 and 103000.0 2469171608 760 10s 2m3s Test Config: Presto 0.167.0.6t (Teradata) On EMR Data on S3 (external tables) Cluster size: 60 worker node x r4.4xlarge Key points: Use ORC (Or Parquet) for performant query © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ac hieving Inter ac tive Quer y Speed
  • 67. Benefits We’ve Seen  Analysts can now interactively analyze 1000x more market events (billons vs million rows)  Querying order route detail went from 10s of minutes to seconds  Quicker turnaround to provide data for  Machine Learning model development is easier Analytics ResiliencyAgility  Easily reprocess data … used to take weeks to find capacity now can be done in day/days  Cloud makes it very easy to share (even large) data sets with third parties in Cloud  Can perform model (pattern) reruns in days not weeks  Market volume changes no longer disruptive events  Improved system uptime vs in-house At TCO 30% less expensive than with our data center
  • 68. Analytics Analytics On 450k Subscribers Using Amazon Redshift Ad Campaign Effectiveness Analysis Platform Financial Simulations Platform Trading History Clickstream Data From 300 Websites DNA Sequencing
  • 69. Américo de Paula Solutions Architecture Manager americop@amazon.com Worldwide | N. America | LATAM | UK/IR | EMEA | APAC | Japan | China