Originally, Hadoop was used as a batch analytics tool; however, this is rapidly changing, as applications move towards real-time processing and streaming. Amazon Elastic MapReduce has made running Hadoop in the cloud easier and more accessible than ever. Each day, tens of thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size — from university students to Fortune 50 companies. We recently launched Amazon Kinesis – a managed service for real-time processing of high volume, streaming data. Amazon Kinesis enables a new class of big data applications which can continuously analyze data at any volume and throughput, in real-time. Adi will discuss each service, dive into how customers are adopting the services for different use cases, and share emerging best practices. Learn how you can architect Amazon Kinesis and Amazon Elastic MapReduce together to create a highly scalable real-time analytics solution which can ingest and process terabytes of data per hour from hundreds of thousands of different concurrent sources. Forever change how you process web site click-streams, marketing and financial transactions, social media feeds, logs and metering data, and location-tracking events.
2. Amazon Kinesis & Big Data
o Motivations for Stream Processing
Origins: Internal metering capability
Expanding the big data processing landscape
o Customer view on streaming data
o Amazon Kinesis Overview
Amazon Kinesis Architecture
Kinesis concepts & Demo
o Amazon Elastic MapReduce and Kinesis
EMR connector morphs Kinesis streamed data into Hadoop framework
Applying Hadoop frameworks to streaming data
o Amazon Kinesis and Redshift:
Upworthy presents “Shrinking Redshift data load times from 24 hours to 10 minutes”
Presented by Daniel Mintz, Director of Business Intelligence, Upworthy
4. Origins: Internal AWS Metering Capability
Workload
• 10s of millions records/sec
• Multiple TB per hour
• 100,000s of sources
Pain points
• Doesn’t scale elastically
• Customers want real-time alerts
• Expensive to operate
• Relies on eventually consistent
storage
5. Expanding the Big Data Processing Landscape
• Query Engine Approach
• Pre-computations such as
indices and dimensional views
improve performance
• Historical, structured data
• HIVE/SQL-on-Hadoop/ M-R/
Spark
• Batch programs, or other
abstractions breaking down
into MR style computations
• Historical, Semi-structured
data
• Custom computations of
relative simple complexity
• Continuous Processing –
filters, sliding windows,
aggregates – on infinite data
streams
• Semi/Structured data,
generated continuously in
real-time
Traditional Data Warehousing Hadoop Style Processing Stream Processing
6. A Generalized Data Flow
Many different technologies, at different stages of evolution
Client/Sensor Aggregator Continuous
Processing
Storage Analytics +
Reporting
7. Our Big Data Transition
Old Posture
• Capture huge amounts of data
and process it in hourly or daily
batches
New Requirements
• Make decisions faster,
sometimes in real-time
• Scale entire system elastically
• Make it easy to “keep
everything”
• Multiple applications can
process data in parallel
8. Foundation for Data Streams Ingestion, Continuous Processing
Right Toolset for the Right Job
Real-time Ingest
• Highly Scalable
• Durable
• Elastic
• Replay-able Reads
Continuous Processing FX
• Load-balancing incoming streams
• Fault-tolerance, Checkpoint / Replay
• Elastic
• Enable multiple apps to process in parallel
Enable data movement into Stores/ Processing Engines
Managed Service
Low end-to-end latency
Continuous, real-time workloads
10. Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis
Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data
Software/
Technology
IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational
Intelligence
Digital Ad Tech./
Marketing
Advertising Data aggregation Advertising metrics like coverage, yield,
conversion
Analytics on User engagement with
Ads, Optimized bid/ buy engines
Financial Services Market/ Financial Transaction order data
collection
Financial market data metrics Fraud monitoring, and Value-at-Risk
assessment, Auditing of market order
data
Consumer Online/
E-Commerce
Online customer engagement data
aggregation
Consumer engagement metrics like
page views, CTR
Customer clickstream analytics,
Recommendation engines
Customer Scenarios across Industry Segments
1 2 3
11. Big streaming data comes from the small
{
"payerId": "Joe",
"productCode": "AmazonS3",
"clientProductCode": "AmazonS3",
"usageType": "Bandwidth",
"operation": "PUT",
"value": "22490",
"timestamp": "1216674828"
}
Metering Record
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700]
"GET /apache_pb.gif HTTP/1.0" 200 2326
Common Log Entry
<165>1 2003-10-11T22:14:15.003Z
mymachine.example.com evntslog - ID47
[exampleSDID@32473 iut="3"
eventSource="Application"
eventID="1011"][examplePriority@32473
class="high"]
Syslog Entry
“SeattlePublicWater/Kinesis/123/Realtime”
– 412309129140
MQTT Record <R,AMZN ,T,G,R1>
NASDAQ OMX Record
12. What Biz. Problem needs to be solved?
Mobile/ Social Gaming Digital Advertising Tech.
Deliver continuous/ real-time delivery of game
insight data by 100’s of game servers
Generate real-time metrics, KPIs for online ad
performance for advertisers/ publishers
Custom-built solutions operationally complex to
manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based
processing pipeline
• Delay with critical business data delivery
• Developer burden in building reliable, scalable
platform for real-time data ingestion/ processing
• Slow-down of real-time customer insights
• Lost data with Store/ Forward layer
• Operational burden in managing reliable, scalable
platform for real-time data ingestion/ processing
• Batch-driven real-time customer insights
Accelerate time to market of elastic, real-time
applications – while minimizing operational
overhead
Generate freshest analytics on advertiser performance
to optimize marketing spend, and increase
responsiveness to clients
14. Amazon Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
15. Kinesis Stream:
Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by splitting
or merging Shards
• Replay data inside of 24Hr.
Window
16. Putting Data into Kinesis
Simple Put interface to store data in Kinesis
• Producers use a PUT call to store data in a Stream
• PutRecord {Data, PartitionKey, StreamName}
• A Partition Key is supplied by producer and used to
distribute the PUTs across Shards
• Kinesis MD5 hashes supplied partition key over the
hash key range of a Shard
• A unique Sequence # is returned to the Producer
upon a successful PUT call
17. Building Kinesis Processing Apps: Kinesis Client Library
Open Source library for fault-tolerant, continuous processing apps
• Java client library, source available on Github
• Build app with KCL on your EC2 instance(s)
• KCL is intermediary b/w your application & stream
• Automatically starts a Kinesis Worker for each shard
• Simplifies reading by abstracting individual shards
• Increase / Decrease Workers as # of shards changes
• Checkpoints to keep track of a Worker’s location in
the stream, Restarts Workers if they fail
• Deploy app on your EC2 instances
• Integrates with AutoScaling groups to redistribute workers
to new instances
18. Amazon Kinesis Connector Library
Open Source code to Connect Kinesis with S3, Redshift, DynamoDB
S3
DynamoDB
Redshift
Kinesis
ITransformer
• Defines the
transformation
of records
from the
Amazon
Kinesis stream
in order to suit
the user-
defined data
model
IFilter
• Excludes
irrelevant
records from
the
processing.
IBuffer
• Buffers the set
of records to
be processed
by specifying
size limit (# of
records)& total
byte count
IEmitter
• Makes client
calls to other
AWS services
and persists
the records
stored in the
buffer.
19. Sending & Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client
Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Consuming
AWS Mobile
SDK
21. Amazon Elastic MapReduce (EMR)
Managed Service for Hadoop based data processing
• Managed service
• Easy to tune clusters and trim
costs
• Support for multiple data stores
• Unique features that ensure
customer success on AWS
22. Applying batch processing to streamed data
Client/ Sensor Recording
Service
Aggregator/
Sequencer
Continuous
processor for
dashboard
Storage
Analytics and
Reporting
Amazon Kinesis Amazon EMR
Streaming Data Ingestion
23. What would this look like?
Processing
Input
• User
• Dev
My Website
Kinesis
Log4J
Appender
push to
Kinesis
EMR
Hive
Pig
Cascading
MapReduce
pull from
24. • Features offered starting EMR AMI 3.0.4
– Simply spin up the EMR cluster like normal
• Logical names
– Labels that define units of work (Job A vs Job B)
• Iterations
– Provide idempotency (pessimistic locking of the Logical name)
• Checkpoints
– Creating an input start and end points to allow batch processing
Features and Functionality
25. Iterations – the run of a Job
Iteration 1 Iteration 2 Iteration 3 Iteration 4
Trim Horizon seqID
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
-24 hours
Logical Name
Stream
NOW
Latest seqID
Next
30. • InputFormat handles service errors
– Throttling: 400
– Service unavailable errors : 503
– Internal server 500
– Http Client exceptions : socket connection timeout
• Hadoop handles retry of failed map tasks
• Iterations allow retrys
– Fixed input boundaries on a stream (idempotency for reruns)
– Enable multiple queries on the same input boundaries
Handling errors
31. Hadoop Ecosystem Implementation
• Hadoop Input format
• Hive Storage Handler
• Pig Load Function
• Cascading Scheme and
Tap
• Join multiple data
sources for analysis
• Filter and preprocess
streams
• Export and archive
streaming data
Use CasesImplementations
32. Writing to Kinesis using Log4J
Option Default Description
log4j.appender.KINESIS.streamName AccessLog
Stream
Stream name to which data is to be published.
log4j.appender.KINESIS.encoding UTF-8 Encoding used to convert log message strings into
bytes before sending to Amazon Kinesis.
log4j.appender.KINESIS.maxRetries 3 Maximum number of retries when calling Kinesis APIs
to publish a log message.
log4j.appender.KINESIS.backoffInterval 100ms Milliseconds to wait before a retry attempt.
log4j.appender.KINESIS.threadCount 20 Number of parallel threads for publishing logs to
configured Kinesis stream.
log4j.appender.KINESIS.bufferSize 2000 Maximum number of outstanding log messages to
keep in memory.
log4j.appender.KINESIS.shutdownTimeout 30 Seconds to send buffered messages before application
JVM quits normally.
.error("Cannot find resource XYX… go do something about it!");
37. What’s Upworthy
• We’ve been called
– “Social media with a mission” by our About Page
– “The fastest growing media site of all time” by Fast Company
– “The Fastest Rising Startup” by The Crunchies
– “That thing that’s all over my newsfeed” by my annoyed friends
– “The most data-driven media company in history” by me,
optimistically
38. What We Do
• We aim to drive
massive amounts of
attention to things that
really matter.
• We do that by finding,
packaging, and
distributing great,
meaningful content.
40. When We Started
• Had built a data warehouse from scratch
• Hadoop-based batch workflow
• Nightly ETL cycle
• 2.5 Engineers
• Wanted to do all three:
– Comprehensive
– Ad Hoc
– Real-Time
41. The Decision
• Speed up our current system, rather than
building a parallel one
• Had looked at alternative stream processors
– Cost
– Maintenance
• Comfortable with concept of application log
stream
42. How It Works
• Log Drain receives, formats, batches and zips
• PUTs 50k GZIP batches on Kinesis stream
• Three types of Kinesis consumers:
1. Archiver – Batch and write permanent record
2. Stats – Filter, sample and count; Report to StatHat
3. Transformer – Filter, batch, validate; writes temporary BSVs to S3
• Database Importer handles manifest files.
• S3 handles garbage collection.
43. Our system now
• Stats:
– Average: ~1085 events/second
– Peak: ~2500 events/second
• Data is available in Redshift < 10 min
• Kinesis has been cheap, stable, and gives us
redundancy and resiliency.
• Computation model that’s easy to reason about
44. Resiliency
• When something goes
wrong, you have 24
hours.
• Timestamp at outset.
Track lag at each step.
• Bigger workers (more
CPU, RAM, deeper
queues) can catch us
up very fast.
46. Some Lessons
• You can use one pipeline for everything.
• High-cardinality fact data belongs in Kinesis.
• EDN works well with Kinesis.
• We prefer explicit checkpointing. (Your mileage may
vary.)
• Languages that run on the JVM can take advantage of
AWS Client Libraries.
47. Kinesis Pricing
Simple, Pay-as-you-go, & no up-front costs
Pricing Dimension Value
Hourly Shard Rate $0.015
Per 1,000,000 PUT
transactions:
$0.028
• Customers specify throughput requirements in shards, that they control
• Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress
• Inbound data transfer is free
• EC2 instance charges apply for Kinesis processing applications
48. Canonical Data flows with Amazon Kinesis
Continuous Metric
Extraction
Incremental Stats
Computation
Record Archiving
Live Dashboard
49. Try out Amazon Kinesis
• Try out Amazon Kinesis
– http://aws.amazon.com/kinesis/
• Thumb through the Developer Guide
– http://aws.amazon.com/documentation/kinesis/
• Test drive the sample app
– https://github.com/awslabs/amazon-kinesis-data-visualization-sample
• Kinesis Connector Framework
– https://github.com/awslabs/amazon-kinesis-connectors
• Read EMR-Kinesis FAQs
– http://aws.amazon.com/elasticmapreduce/faqs/#kinesis-connector
• Visit, and Post on Kinesis Forum
– https://forums.aws.amazon.com/forum.jspa?forumID=169#