Amazon Web Services proporciona una amplia gama de servicios que le ayudarán a crear e implementar aplicaciones de análisis de big data de forma rápida y sencilla. AWS ofrece un acceso rápido a recursos de TI económicos y flexibles, algo que permitirá escalar prácticamente cualquier aplicación de big data con rapidez, incluidos almacenamiento de datos, análisis de clics, detección de elementos fraudulentos, motores de recomendación, proceso ETL impulsado por eventos, informática sin servidor y procesamiento del Internet de las cosas. Con AWS no necesita hacer grandes inversiones iniciales de tiempo o dinero para crear y mantener la infraestructura. En su lugar, puede aprovisionar exactamente el tipo y el tamaño adecuado de los recursos que necesita para impulsar sus aplicaciones de análisis de big data. Puede obtener acceso a tantos recursos como necesite, prácticamente al instante, y pagar únicamente por los utilice.
3. The Diminishing Value of Data
Recent data is highly valuable
If you act on it in time
Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
If you have the means to combine them
4. Traditional Data Warehousing
Wikipedia: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a
system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are
central repositories of integrated data from one or more disparate sources. They store current and historical data and
are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could
range from annual and quarterly comparisons and trends to detailed daily sales analysis.
7. Big Data
When your data sets become so large and complex
you have to start innovating around how to
collect, store, process, analyze, and share them.
8. Analytics and Big Data
Technologies and techniques for
working productively with massive
amounts of data at any scale in
either batch or real-time.
9. The Industry Problem
Growth in Data
(mostly Unstructured)
& Analytics
Average Growth in
Traditional DW
Data
Average IT Budget
11. Why Big Data?
Security threat detection
User Behavior Analysis
Smart Application (Machine Learning)
Business Intelligence
Fraud detection
Financial Modeling and Forecasting
Spending optimization
Real-time alerting
Get answers faster and be able to ask questions not possible to today.
13. Big Data was Meant for the Cloud
Big Data Cloud Computing
Variety, volume, and velocity requiring new tools Variety of compute, storage, and networking options
Potentially massive datasets
Massive, virtually unlimited capacity
Iterative, experimental style of data
manipulation and analysis
Iterative, experimental style of IT infrastructure
deployment and usage
At its most efficient with highly variable
workloads
Frequently non-steady-state workloads
with peaks and valleys
Absolute performance not as critical as “time to
results”; shared resources are a bottleneck
Parallel compute projects allow each workgroup to
have more autonomy and get faster results
14. Elastic and highly scalable
No upfront capital expense
Only pay for what you use
+
+
Available on-demand
+
= the Cloud removes constraints
15. The AWS Approach
• Flexible - Use the best tool for the job
• Data structure, latency, throughput, access patterns
• Low Cost - Big data ≠ big cost
• Scalable – Data should be immutable (append-only)
• Batch/speed/serving layer
• Minimize Admin Overhead - Leverage AWS managed services
• No or very low admin
• Be Agile – Fail fast, test more, optimize Big Data at a lower cost
16. Starting small is powerful,
when you can scale up fast
Scaling up your analytics systems With AWS Traditional IT *
Get a new BI server 20 minutes Weeks to Months
Upgrade your analytics server to the newest
Intel processors and add 16GB memory
15 minutes Weeks
Add 500TB of storage Instant Weeks to Months
Grow a DWH cluster from 8GB to 1PB 1 hour Several Months
Build a 1024-node Hadoop cluster 30 minutes Unlikely
Roll out multi-region production environment Hours Months
* actual provisioning times in a well-organized IT division
17. AWS Big Data Platform
EMR EC2
Glacier
S3
Import Export
Kinesis
Direct Connect
Machine LearningRedshift
DynamoDB
AWS Database
Migration Service
Collect Orchestrate Store Analyze
AWS Lambda
AWS IoT
AWS Data Pipeline
Amazon Kinesis
Analytics
Amazon
SNS
AWSSnowball
Amazon
SWF
Amazon Athena
Amazon
QuickSight
Amazon AuroraAWS Glue
18. Optimal Combinations of Interoperable Services
Amazon Redshift Amazon Elastic
MapReduce
Data Warehouse Semi-structured
Amazon
Glacier
Amazon Simple
Storage Service
Data Storage Archive
Amazon
DynamoDB
Amazon
Machine
Learning
Amazon Kinesis
NoSQL Predictive Models Other AppsStreaming
20. Processing & Analytics
Real-time Batch
AI & Predictive
BI & Data Visualization
Transactional &
RDBMS
AWS Lambda
Apache Storm
on EMR
Apache Flink
on EMR
Spark Streaming
on EMR
Elasticsearch
Service
Kinesis Analytics,
Kinesis Streams
DynamoDB
NoSQL DB Relational Database
Aurora
EMR
Hadoop, Spark,
Presto
Redshift
Data Warehouse
Athena
Query Service
Amazon Lex
Speech
recognition
Amazon
Rekognition
Amazon Polly
Text to speech
Machine Learning
Predictive analytics
Kinesis Streams
& Firehose
21. Data Lakes with new tools
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensor
s
Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
22. Who consumes Analytics
Business Users
For making strategic decisions
e.g. reports like YoY growth of
sales
Data Scientists
To identify models for futuristic
analytics of data
e.g. typically long running ad-
hoc queries on data
Developers
To clean and process application
data using Big Data Jobs
e.g. Processing incoming click
stream data
Consumers
Real time actions and
intelligence on
consumption. Ex. Spend
patterns in banks or
telecom
New
23. 1.2TB/Day logs
30TB /Day data
250 Hadoop Jobs
75Billion transactions/Day
5 Petabytes of Data
A few AWS customer on Big Data / Data Lakes
25 PB Data Warehouse
on Amazon S3
> 1PB read each day
25. Why Amazon S3 for Big Data?
• Scalable
• Virtually Unlimited number of objects
• Very high bandwidth – no aggregate throughput limit
• Cost-Effective:
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot Instances
• Tiered storage(Standard, IA, Amazon Glacier) via life-cycle policy
•Flexible Access
• Direct access by big data frameworks (Spark, Hive, Presto)
• Shared access: Multiple (Spark, Hive, Presto) clusters can use the same data
26. NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S
IN MARKET CAP REPRESENTING
WORTH $9.6TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDSMORE THAN U.S.
1 TRILLIONNATIONAL VALUE IS TIED
TO OUR LIBRARY OF MORE THAN
41,000 GLOBAL INDEXES
N A S D A Q T E C H N O L O G Y
IS USED TO POWER MORE THAN
IN 50 COUNTRIES
100 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION
MESSAGES/SECOND
AT SUB-40 MICROSECONDS
AV E R A G E S P E E D S
1 C L E A R I N G H O U S E
WE OWN AND OPERATE
26 MARKETS
5 CENTRAL SECURITIES
DEPOSITORIES
INCLUDING
A C R O S S A S S E T CL A S SE S
& GEOGRAPHIES
27. • Nasdaq implements an S3 data lake + Redshift data warehouse
architecture
• Most recent two years of data is kept in the Redshift data warehouse
and snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
• Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon Redshift &
S3 data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set (~1100
tables)
28. Our vision is to be earth’s most customer-centric company;
to build a place where people can come to find and discover
anything they might want to buy online.
Amazon
38. Moving Forward - AWS
Amazon
Redshift
S3 / EDX - Separate
Storage from Compute by
leveraging a parallel file
system as a global data
exchange
• Redshift - Preferred
platform SQL based
Analysis and traditional
Data Warehouse Data
• Focus is “Business Users”
• EMR – Scalable “Do
Everything” Platform - Enable
Teams who have chosen EMR
by providing Curated Data
• Focus is “Programattic Access”
50. “For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We
can do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case Study: Re-architecting Compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20m annually by using AWS
51. Fraud Detection
FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion
trading events per day and securely store over 5 petabytes of data,
attaining savings of $10-20mm per year.
54. Keeping track of 40M+ tables can be a challenge…
What data do we have?
Where is the data used?
What is the source of this data?
How many versions of this data exist?
What is the retention policy?
55. Data availability and analytics is complex
Business
Analysts
Data
Scientists Data
Analysts
Data
Engineers
What data do we have?
What format is it in?
Where to I get it?
Get this data for them…
Not on disk—pull from tape
Prepare & Format
Oops, I need more data … Repeat!
I need data in different format …
Repeat!
etc…, etc…
56. Infrastructure can be limiting & costly
Does not scale well as volumes and workloads
increase
Duplication of effort in data management (data
lifecycle, retention, versioning)
Data sync issues—manual effort to keep data in
sync
Challenges to run analytics across fragmented
data
Costly system maintenance and upgrades
57. Key pr inc iples of our big data ar c hitec ture
• Separate storage and compute
• Register and track all data in our data catalog
• Keep all versions of each data set
• Protect the data—encrypt at rest and in transit
• Partition data for extra performance
• Backup to another region for business continuity
• Optimize storage and processing costs
58. Catalog for centralized data management
finraos.github.io/herd
Unified catalog
• Schemas
• Versions
• Encryption type
• Storage policies
Lineage and Usage
• Track publishers and consumers
• Easily identify jobs and derived data sets
Shared Metastore
• Common definition of tables and partitions
• Use with Spark, Presto, Hive, and so on
• Faster instantiation of clusters
59. FINRA’s AWS Architecture
3
INTAKE MANAGEMENT ANALYTICS
Validation
Normalization
Linkage
Amazon GlacierAmazon S3
Machine Learning
Amazon EMR
Amazon Redshift
text text
API API
Structured &
Unstructured Data
Millions of documents
25K data checks daily
Normalization
33,000 Servers Daily
Centralized Data
Normalized Data
Integrated Data
Discoverable
Direct Data Query
ML/AI Platforms
Applications/ Visualizations
Exchange Data
12 Equities Markets
4 Options Markets
SIP Data
SIP trades
SIP NBBO
OPRA
Broker Dealer data
4000 plus firms
Third Party Data
Bloomberg
Thomson Reuters
DTCC
OCC
Machine Learning
Amazon EMR
Amazon Redshift
Amazon GlacierAmazon S3
KMS
IAM
RDS
60. Leverage the Data: Apps, Query, Machine Learning
Data Lake
Audit Trail
Market
Surveillance
Ad-Hoc
Lifecycle
Viewer
App: Powerful UI;
billions of rows of
market tx data
Pattern detection
models and
execution
Investigation
and data profiling
through SQL
Retrieve market
events to render
order lifecycle
Data Science
Best of breed
tools, machine
learning
62. Universal Data Science Platform (UDSP)
• Environment (EC2) for each
Data Scientist
• Simple provisioning interface
• Right instance (memory or
GPU) for job
• Access to all the data in
Data Lake
• Shut off when not using for
savings
• Secure (LDAP AuthN/Z +
Encryption)
Data
Scientist
63. UDSP – Inventory – not just R
• R 3.2.5, Python (2.7.12 and 3.4.3)
• Packages
• R: 300+ Python: 100+
• Tools for Building Packages
• gcc, gfortran, make, java, maven, ant…
• IDEs
• Jupyter, RStudio Server
• Deep Learning
• CUDA, CuDNN (if GPU present)
• Theano, Caffe, Torch
• TensorFlow
16
64. FIN R A U s age Statis tic s on AW S
33k+ Amazon EC2 nodes
per day
93%+ of EC2 usage is EMR
based (mostly SPOT)
20Pb+ storage (Amazon
S3, Amazon Glacier)
13
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
40289 41770
40512
36589
33275
16023
8710
2145 2323
2542
2363
2363
1686
1590
231 231 2…
231
231
231
231
Hadoop/Spark Web, App & RDS Redshift
Node Distribution for May 6-12 (~33k/day)
65. Achieve Dynamic processing
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
11/1 11/8 11/15 11/22 11/29
Daily Order Volume (Billions)
0
2000
4000
6000
8000
10000
12000
ComputeNodes
Hour of Day
AWS EMR compute on EC2
EMR
20k – 25k EC2 nodes per day 93% of EC2 is on EMR
Avg EC2 node: 3 cores
Avg EC2 uptime: 3 hours
96% of EC2 nodes live < 24 hrsOver 50k nodes on peak day
67. Benefits We’ve Seen
Analysts can now interactively
analyze 1000x more market
events (billons vs million
rows)
Querying order route detail
went from 10s of minutes to
seconds
Quicker turnaround to provide
data for
Machine Learning model
development is easier
Analytics ResiliencyAgility
Easily reprocess data …
used to take weeks to find
capacity now can be done in
day/days
Cloud makes it very easy to
share (even large) data sets
with third parties in Cloud
Can perform model (pattern)
reruns in days not weeks
Market volume changes no
longer disruptive events
Improved system uptime vs
in-house
At TCO 30% less expensive than with our data center
68. Analytics
Analytics On 450k
Subscribers Using Amazon
Redshift
Ad Campaign Effectiveness
Analysis Platform
Financial Simulations
Platform
Trading History
Clickstream Data
From 300 Websites
DNA Sequencing
69. Américo de Paula
Solutions Architecture Manager
americop@amazon.com
Worldwide | N. America | LATAM | UK/IR | EMEA | APAC | Japan | China