Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Kinesis & Big Data
Meld real-time streaming with EMR (Hadoop), &
Redshift (Data Warehousing)
Adi Krishnan, AWS Product Management, @adityak
Daniel Mintz, Director of BI, Upworthy, @danielmintz
July 10, 2014

Amazon Kinesis & Big Data
o Motivations for Stream Processing
 Origins: Internal metering capability
 Expanding the big data processing landscape
o Customer view on streaming data
o Amazon Kinesis Overview
 Amazon Kinesis Architecture
 Kinesis concepts & Demo
o Amazon Elastic MapReduce and Kinesis
 EMR connector morphs Kinesis streamed data into Hadoop framework
 Applying Hadoop frameworks to streaming data
o Amazon Kinesis and Redshift:
 Upworthy presents “Shrinking Redshift data load times from 24 hours to 10 minutes”
 Presented by Daniel Mintz, Director of Business Intelligence, Upworthy

The Motivation for Continuous Processing

Origins: Internal AWS Metering Capability
Workload
• 10s of millions records/sec
• Multiple TB per hour
• 100,000s of sources
Pain points
• Doesn’t scale elastically
• Customers want real-time alerts
• Expensive to operate
• Relies on eventually consistent
storage

Expanding the Big Data Processing Landscape
• Query Engine Approach
• Pre-computations such as
indices and dimensional views
improve performance
• Historical, structured data
• HIVE/SQL-on-Hadoop/ M-R/
Spark
• Batch programs, or other
abstractions breaking down
into MR style computations
• Historical, Semi-structured
data
• Custom computations of
relative simple complexity
• Continuous Processing –
filters, sliding windows,
aggregates – on infinite data
streams
• Semi/Structured data,
generated continuously in
real-time
Traditional Data Warehousing Hadoop Style Processing Stream Processing

A Generalized Data Flow
Many different technologies, at different stages of evolution
Client/Sensor Aggregator Continuous
Processing
Storage Analytics +
Reporting

Our Big Data Transition
Old Posture
• Capture huge amounts of data
and process it in hourly or daily
batches
New Requirements
• Make decisions faster,
sometimes in real-time
• Scale entire system elastically
• Make it easy to “keep
everything”
• Multiple applications can
process data in parallel

Foundation for Data Streams Ingestion, Continuous Processing
Right Toolset for the Right Job
Real-time Ingest
• Highly Scalable
• Durable
• Elastic
• Replay-able Reads
Continuous Processing FX
• Load-balancing incoming streams
• Fault-tolerance, Checkpoint / Replay
• Elastic
• Enable multiple apps to process in parallel
Enable data movement into Stores/ Processing Engines
Managed Service
Low end-to-end latency
Continuous, real-time workloads

Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis
Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data
Software/
Technology
IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational
Intelligence
Digital Ad Tech./
Marketing
Advertising Data aggregation Advertising metrics like coverage, yield,
conversion
Analytics on User engagement with
Ads, Optimized bid/ buy engines
Financial Services Market/ Financial Transaction order data
collection
Financial market data metrics Fraud monitoring, and Value-at-Risk
assessment, Auditing of market order
data
Consumer Online/
E-Commerce
Online customer engagement data
aggregation
Consumer engagement metrics like
page views, CTR
Customer clickstream analytics,
Recommendation engines
Customer Scenarios across Industry Segments
1 2 3

Big streaming data comes from the small
{
"payerId": "Joe",
"productCode": "AmazonS3",
"clientProductCode": "AmazonS3",
"usageType": "Bandwidth",
"operation": "PUT",
"value": "22490",
"timestamp": "1216674828"
}
Metering Record
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700]
"GET /apache_pb.gif HTTP/1.0" 200 2326
Common Log Entry
<165>1 2003-10-11T22:14:15.003Z
mymachine.example.com evntslog - ID47
[exampleSDID@32473 iut="3"
eventSource="Application"
eventID="1011"][examplePriority@32473
class="high"]
Syslog Entry
“SeattlePublicWater/Kinesis/123/Realtime”
– 412309129140
MQTT Record <R,AMZN ,T,G,R1>
NASDAQ OMX Record

What Biz. Problem needs to be solved?
Mobile/ Social Gaming Digital Advertising Tech.
Deliver continuous/ real-time delivery of game
insight data by 100’s of game servers
Generate real-time metrics, KPIs for online ad
performance for advertisers/ publishers
Custom-built solutions operationally complex to
manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based
processing pipeline
• Delay with critical business data delivery
• Developer burden in building reliable, scalable
platform for real-time data ingestion/ processing
• Slow-down of real-time customer insights
• Lost data with Store/ Forward layer
• Operational burden in managing reliable, scalable
platform for real-time data ingestion/ processing
• Batch-driven real-time customer insights
Accelerate time to market of elastic, real-time
applications – while minimizing operational
overhead
Generate freshest analytics on advertiser performance
to optimize marketing spend, and increase
responsiveness to clients

Amazon Kinesis
Managed Service for streaming data ingestion, and processing

Amazon Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts

Kinesis Stream:
Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by splitting
or merging Shards
• Replay data inside of 24Hr.
Window

Putting Data into Kinesis
Simple Put interface to store data in Kinesis
• Producers use a PUT call to store data in a Stream
• PutRecord {Data, PartitionKey, StreamName}
• A Partition Key is supplied by producer and used to
distribute the PUTs across Shards
• Kinesis MD5 hashes supplied partition key over the
hash key range of a Shard
• A unique Sequence # is returned to the Producer
upon a successful PUT call

Building Kinesis Processing Apps: Kinesis Client Library
Open Source library for fault-tolerant, continuous processing apps
• Java client library, source available on Github
• Build app with KCL on your EC2 instance(s)
• KCL is intermediary b/w your application & stream
• Automatically starts a Kinesis Worker for each shard
• Simplifies reading by abstracting individual shards
• Increase / Decrease Workers as # of shards changes
• Checkpoints to keep track of a Worker’s location in
the stream, Restarts Workers if they fail
• Deploy app on your EC2 instances
• Integrates with AutoScaling groups to redistribute workers
to new instances

Amazon Kinesis Connector Library
Open Source code to Connect Kinesis with S3, Redshift, DynamoDB
S3
DynamoDB
Redshift
Kinesis
ITransformer
• Defines the
transformation
of records
from the
Amazon
Kinesis stream
in order to suit
the user-
defined data
model
IFilter
• Excludes
irrelevant
records from
the
processing.
IBuffer
• Buffers the set
of records to
be processed
by specifying
size limit (# of
records)& total
byte count
IEmitter
• Makes client
calls to other
AWS services
and persists
the records
stored in the
buffer.

Sending & Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client
Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Consuming
AWS Mobile
SDK

Amazon Kinesis & Elastic MapReduce

Amazon Elastic MapReduce (EMR)
Managed Service for Hadoop based data processing
• Managed service
• Easy to tune clusters and trim
costs
• Support for multiple data stores
• Unique features that ensure
customer success on AWS

Applying batch processing to streamed data
Client/ Sensor Recording
Service
Aggregator/
Sequencer
Continuous
processor for
dashboard
Storage
Analytics and
Reporting
Amazon Kinesis Amazon EMR
Streaming Data Ingestion

What would this look like?
Processing
Input
• User
• Dev
My Website
Kinesis
Log4J
Appender
push to
Kinesis
EMR
Hive
Pig
Cascading
MapReduce
pull from

• Features offered starting EMR AMI 3.0.4
– Simply spin up the EMR cluster like normal
• Logical names
– Labels that define units of work (Job A vs Job B)
• Iterations
– Provide idempotency (pessimistic locking of the Logical name)
• Checkpoints
– Creating an input start and end points to allow batch processing
Features and Functionality

Iterations – the run of a Job
Iteration 1 Iteration 2 Iteration 3 Iteration 4
Trim Horizon seqID
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
-24 hours
Logical Name
Stream
NOW
Latest seqID
Next

Logical Names & Checkpointing – allows
efficient batching
Kinesis Stream
NOW
Latest seqIDTrim Horizon seqID
-24 hours
Logical Name
Stream

• Dynamo DB
Metadata Storage
Logical Name A
Mapper 1
Mapper 2
Mapper 3
Mapper 4
Logical Name B
Mapper 1
Mapper 2
Mapper 3
Mapper 4

Each Kinesis shard maps 1:1 to a Hadoop
map task
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
Mapper 2
Kinesis
Hadoop
Next
Logical Name
Mapper 1
Shard 2
Shard 1
Mapper 2
Mapper 1
Mapper 2
Mapper 1
Mapper 2
Mapper 1
-24 hours
Start seq ID End seq ID
NOW
Latest seqID

Handling stream scaling events
Trim Horizon seqID
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
Mapper 2
Kinesis
Hadoop
Logical Name
Mapper 1
Shard 2
Shard 1
Mapper 2
Mapper 1
Mapper 2
Mapper 1
3
Mapper 3
1
2
4
5
Split
Shard 2
Shard 1
Shard 2
Shard 1
Shard 3
Shard 2
Shard 1
Shard 3
Split Merge
-24 hours
Latest seqID
NOW
Next

• InputFormat handles service errors
– Throttling: 400
– Service unavailable errors : 503
– Internal server 500
– Http Client exceptions : socket connection timeout
• Hadoop handles retry of failed map tasks
• Iterations allow retrys
– Fixed input boundaries on a stream (idempotency for reruns)
– Enable multiple queries on the same input boundaries
Handling errors

Hadoop Ecosystem Implementation
• Hadoop Input format
• Hive Storage Handler
• Pig Load Function
• Cascading Scheme and
Tap
• Join multiple data
sources for analysis
• Filter and preprocess
streams
• Export and archive
streaming data
Use CasesImplementations

Writing to Kinesis using Log4J
Option Default Description
log4j.appender.KINESIS.streamName AccessLog
Stream
Stream name to which data is to be published.
log4j.appender.KINESIS.encoding UTF-8 Encoding used to convert log message strings into
bytes before sending to Amazon Kinesis.
log4j.appender.KINESIS.maxRetries 3 Maximum number of retries when calling Kinesis APIs
to publish a log message.
log4j.appender.KINESIS.backoffInterval 100ms Milliseconds to wait before a retry attempt.
log4j.appender.KINESIS.threadCount 20 Number of parallel threads for publishing logs to
configured Kinesis stream.
log4j.appender.KINESIS.bufferSize 2000 Maximum number of outstanding log messages to
keep in memory.
log4j.appender.KINESIS.shutdownTimeout 30 Seconds to send buffered messages before application
JVM quits normally.
.error("Cannot find resource XYX… go do something about it!");

24 Hours to 10 Minutes
How Upworthy’s Data Pipeline uses Kinesis
Daniel Mintz, Director Business Intelligence, Upworthy, @danielmintz

What’s Upworthy
• We’ve been called
– “Social media with a mission” by our About Page
– “The fastest growing media site of all time” by Fast Company
– “The Fastest Rising Startup” by The Crunchies
– “That thing that’s all over my newsfeed” by my annoyed friends
– “The most data-driven media company in history” by me,
optimistically

What We Do
• We aim to drive
massive amounts of
attention to things that
really matter.
• We do that by finding,
packaging, and
distributing great,
meaningful content.

When We Started
• Had built a data warehouse from scratch
• Hadoop-based batch workflow
• Nightly ETL cycle
• 2.5 Engineers
• Wanted to do all three:
– Comprehensive
– Ad Hoc
– Real-Time

The Decision
• Speed up our current system, rather than
building a parallel one
• Had looked at alternative stream processors
– Cost
– Maintenance
• Comfortable with concept of application log
stream

How It Works
• Log Drain receives, formats, batches and zips
• PUTs 50k GZIP batches on Kinesis stream
• Three types of Kinesis consumers:
1. Archiver – Batch and write permanent record
2. Stats – Filter, sample and count; Report to StatHat
3. Transformer – Filter, batch, validate; writes temporary BSVs to S3
• Database Importer handles manifest files.
• S3 handles garbage collection.

Our system now
• Stats:
– Average: ~1085 events/second
– Peak: ~2500 events/second
• Data is available in Redshift < 10 min
• Kinesis has been cheap, stable, and gives us
redundancy and resiliency.
• Computation model that’s easy to reason about

Resiliency
• When something goes
wrong, you have 24
hours.
• Timestamp at outset.
Track lag at each step.
• Bigger workers (more
CPU, RAM, deeper
queues) can catch us
up very fast.

Some Lessons
• You can use one pipeline for everything.
• High-cardinality fact data belongs in Kinesis.
• EDN works well with Kinesis.
• We prefer explicit checkpointing. (Your mileage may
vary.)
• Languages that run on the JVM can take advantage of
AWS Client Libraries.

Kinesis Pricing
Simple, Pay-as-you-go, & no up-front costs
Pricing Dimension Value
Hourly Shard Rate $0.015
Per 1,000,000 PUT
transactions:
$0.028
• Customers specify throughput requirements in shards, that they control
• Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress
• Inbound data transfer is free
• EC2 instance charges apply for Kinesis processing applications

Canonical Data flows with Amazon Kinesis
Continuous Metric
Extraction
Incremental Stats
Computation
Record Archiving
Live Dashboard

Try out Amazon Kinesis
• Try out Amazon Kinesis
– http://aws.amazon.com/kinesis/
• Thumb through the Developer Guide
– http://aws.amazon.com/documentation/kinesis/
• Test drive the sample app
– https://github.com/awslabs/amazon-kinesis-data-visualization-sample
• Kinesis Connector Framework
– https://github.com/awslabs/amazon-kinesis-connectors
• Read EMR-Kinesis FAQs
– http://aws.amazon.com/elasticmapreduce/faqs/#kinesis-connector
• Visit, and Post on Kinesis Forum
– https://forums.aws.amazon.com/forum.jspa?forumID=169#

Thank You!
Adi Krishnan, Product Management, AWS

Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Similaire à Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce