2. Agenda
Big Data Challenges
Architecture principles
What technologies should you use? How? Why?
Reference architecture
Design patterns
Customer Story: The Move to real-time data architectures,
DNA Oy
6. Architectural Principles
Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, materialized views
Be cost-conscious
• Big data ≠ big cost
7. Simplify Big Data Processing
COLLECT STORE PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
8. Types of DataCOLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
In-memory data structures
Database records
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Search documents
Log files
Messaging
Message MESSAGES
Messaging
Messages
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
Data streams
Transactions
Files
Events
10. Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Characteristics: Hot, Warm, Cold
20. Best Practice: Use the Right Tool for the Job
Data Tier
Search
Amazon Elasticsearch
Service
In-memory
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
MySQL
PostgreSQL
Oracle
SQL Server
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB
22. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon SQS
Message
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCacheFile
LoggingIoTApplicationsTransportMessaging
Amazon ElastiCache
• Managed Memcached or Redis service
Amazon DynamoDB
• Managed NoSQL database service
Amazon RDS
• Managed relational database service
Amazon Elasticsearch Service
• Managed Elasticsearch service
23. Which Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data characteristics → Hot, warm, cold
Cost → Right cost
25. Cost-Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project. The design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
29. Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Message
Takes milliseconds to seconds
Example: Message processing
Amazon SQS applications on Amazon EC2
Stream
Takes milliseconds to seconds
Example: Fraud alerts, 1 minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL,
Storm, AWS Lambda
Artificial Intelligence
Takes milliseconds to minutes
Example: Fraud detection, forecast demand, text to speech
Amazon AI (Lex, Polly, ML, Rekognition), Amazon EMR (Spark
ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe)
Analytics Types & Frameworks PROCESS / ANALYZE
Message
Amazon SQS apps
Amazon EC2
Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
Stream
Amazon EC2
Amazon EMR
Fast
Amazon Redshift
Presto
Amazon
EMR
FastSlow
Amazon Athena
BatchInteractive
Amazon
AI
AI
30. What About ETL?
https://aws.amazon.com/big-data/partner-solutions/
ETLSTORE PROCESS / ANALYZE
Data Integration Partners
Reduce the effort to move, cleanse, synchronize,
manage, and automatize data related processes. AWS Glue
AWS Glue is a fully managed ETL service that makes
it easy to understand your data sources, prepare the
data, and move it reliably between data stores
New
33. STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Apps & Services
Analysis&visualization
Notebooks
IDEAPI
Applications & API
Analysis and visualization
Notebooks
IDE
Business
users
Data scientist,
developers
COLLECT ETL
41. The Move to a real-time data
architecture
Jarno Kartela
Chief Data Scientist, DNA Oy
42. EUR 859 million
Net sales in 2016
EUR 91 million
Operating result in 2016
TV
Finland’s largest cable operator and the
leading pay TV provider
Mobile communications and fixed
network customer subscriptions
3.8 millionOUR VALUES
FAST
DNA's customers receive quick
and helpful service
STRAIGHTFORWARD
DNA’s approach is clear and
responsible
BOLD
We are direct, open-minded
and ready for change
The personnel's satisfaction with DNA
as an employer is at a record-breaking
high level
Strong employee satisfaction
DNA is one of the leading Finnish telecommunications groups
42
1,668
At the end of 2016, there were
1,668 employees working with DNA
Finland’s most extensive retailer of
mobile phones, other mobile devices and
mobile subscriptions
64 DNA stores
Customer
is in the center of DNA’s strategy
Public | DNA Today
68. Benefits
Customer
- unified experience
across channels
- personalized content
- better offers
- less time spent on
browsing DNA
services
- overall better CX
- coming: my data &
ability to change the
data about you as a
customer
Business
- time spent on
marketing 10x less
than before
- sales up 3x across AI-
driven campaigns
- sales up 2x-10x
across campaigns
triggered by real-time
data
- near real time (5 min)
insight on what’s
happening
- insight across
channels
IT / Analytics
- better service quality
with less time &
resources
- devops, automation
- endless scale for
analytics & data
- ease of try-fail-try-
again
- cost effectiveness
- one source of truth for
data
- ability to serve all
channels & functions
with one platform
- 6 months to full scale
production
69. Coming up:
Rekognition for retail analytics,
Lex/similar for voice dialogue
and chatbots,
GDPR -> create services
instead of fear
71. Summary
Build decoupled systems
• Use Amazon S3 as the data fabric of your data lake
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable log, batch, interactive & real-time views
Be cost-conscious
• Big data ≠ big cost
72. Building a Data Lake on AWS
Kinesis Firehose
Athena
Query Service
73. Resources
• https://aws.amazon.com/blogs/big-data/introducing-the-data-
lake-solution-on-aws/
• AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of
our big data ecosystem (BDM306)
• AWS re:Invent 2016: Deep Dive on Amazon S3 (STG303)
• https://aws.amazon.com/blogs/big-data/reinvent-2016-aws-big-
data-machine-learning-sessions/
• https://aws.amazon.com/blogs/big-data/implementing-
authorization-and-auditing-using-apache-ranger-on-amazon-emr/