Real-Time Streaming Analytics became popular amongst many verticals and use cases. In AdTech, Gaming, Financial Service and IoT, AWS customers are leveraging Amazon Kinesis platform to ingest billions of events every day and process them in real-time. In this session, we will discuss Amazon Kinesis Streams, Amazon Kinesis Firehose and Amazon Kinesis Analytics. We will show best practice and design patterns in integrating Amazon Kinesis platform with other services like Amazon EMR, Redshift, Amazon Elasticsearch and AWS lambda as well as 3rd party connectors like storm, Spark and more.
2. What to expect from this session
• Streaming scenarios
• Amazon Kinesis Platform Overview
• Amazon Kinesis in Big Data Analytics/ML/IoT
• Demo
• Use Case – Processing 100M events per second
• Summary
• Q&A
3. Most data is produced continuously
Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test
4. The diminishing value of data
Recent data is highly valuable
• If you act on it in time
• Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
• If you have the means to combine them
5. Processing real-time, streaming data
• Durable
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
Ingest Transform Analyze React Persist
8. Publish/Subscribe: Key Concepts
• Data producers are decoupled from data consumers
• Publishers don’t know who the consumers are and vice versa
• Queue is responsible for providing durable storage of messages for
some length of time
• A message can be consumed from the queue [0..n] times
• No requirement that a message be delivered exactly once or at least
once
• 1:n relationship between publishers and subscribers
• “Fan-out” effect
• A single message need not be delivered to all subscribers
9. Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Machine Learning and
Actionable Insights
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
Operation
Security
DevOps tools, Ingesting
VPCFlowLogs
Subscribe to
CloudwatchLogs and
analyze logs in Real-Time
Anomaly Detection
10. Amazon Kinesis: Streaming Data Done the AWS Way
Makes it easy to capture, deliver, and process real-time data streams
Pay as you go, no up-front costs
Elastically scalable
Right services for your specific use cases
Real-time latencies
Easy to provision, deploy, and manage
11. Amazon Kinesis
Streams
For Technical
Developers
Build your own custom
applications that process
or analyze streaming
data
Amazon Kinesis
Firehose
For all developers, data
scientists
Easily load massive
volumes of streaming
data into S3,Amazon
Redshift and Amazon
Elasticsearch
Amazon Kinesis
Analytics
For all developers, data
scientists
Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming data made easy
Services make it easy to capture, deliver and process streams on AWS
12. Amazon Kinesis Customer Base Diversity
1 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
80 billion ad
impressions/day, 30 ms
response time | Ad Tech
100 GB/day click streams
from 250+ sites |
Enterprise
50 billion ad
impressions/day sub-50
ms responses | Ad Tech
10 million events/day
| Retail
Amazon Kinesis as Databus -
Migrate from Kafka to Kinesis| Enterprise
Funnel all
production events
through Amazon
Kinesis
16. Amazon Kinesis Streams
Build your own data streaming applications
Easy administration: Simply create a new stream, and set the desired level of capacity with
shards. Scale to match your data throughput rate and volume.
Build real-time applications: Perform continual processing on streaming big data using
Kinesis Client Library (KCL), Kinesis Analytics, Apache Spark/Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
17. Sending and reading data from Streams
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Consuming
AWS Mobile
SDK
Kinesis
Producer
Library
AWS Lambda
Apache
Spark
Kinesis
Agent
18. • Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours – 7 days
• Scale Kinesis streams by splitting or
merging Shards
• Replay data inside of 24Hr -7days
Window
Amazon Kinesis Stream
Managed Ability to capture and store Data
Time-based
seek
19. Amazon Kinesis Streams pricing
Simple, pay-as-you-go, and no up-front costs
Dimension Value
Shard Hour $0.015
PUT Payload (per 1,0000,000 PUTs) $0.014
Extended Data Retention (Up to 7 days),
per Shard Hour
$0.020
21. Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Machine Learning and
Actionable Insights
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
Operation
Security
DevOps tools, Ingesting
VPCFlowLogs
Subscribe to
CloudwatchLogs and
analyze logs in Real-Time
Anomaly Detection
22. Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Redshift
and Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other
destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data
destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming
source data.
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Redshift
and Elasticsearch
23. AWS Platform SDKs Mobile SDKs Kinesis Agent AWS IoT
Amazon S3 Amazon Redshift
• Send data from IT infra, mobile devices, sensors
• Integrated with AWS SDK, agents, and AWS IoT
• Fully managed service to capture streaming
data
• Elastic w/o resource provisioning
• Pay-as-you-go: 3.5 cents / GB transferred
• Batch, compress, and encrypt data before loads
• Loads data into Amazon Redshift tables by using
the COPY command
Amazon Kinesis Firehose
Capture IT and app logs, device and sensor data, and more
Enable near-real time analytics using existing tools
24. Amazon Kinesis Firehose pricing
Simple, pay-as-you-go, and no up-front costs
Dimension Value
Per 1 GB of data ingested $0.035
26. Amazon Kinesis Firehose vs. Amazon Kinesis
Streams
Amazon Kinesis Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing
latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on
Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a
data latency of 60 seconds or higher.
27. Amazon Kinesis agent
Software agent makes submitting data to Firehose easy
• Monitors files and sends new data records to your delivery stream
• Handles file rotation, check pointing, and retry upon failures
• Preprocessing capabilities such as format conversion and log parsing
• Delivers all data in a reliable, timely, and simple manner
• Emits Amazon CloudWatch metrics to help you better monitor and
troubleshoot the streaming process
• Supported on Amazon Linux AMI with version 2015.09 or later, or Red Hat
Enterprise Linux version 7 or later; install on Linux-based server
environments such as web servers, front ends, log servers, and more
• Also enabled for Streams
29. Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Machine Learning and
Actionable Insights
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
Operation
Security
DevOps tools, Ingesting
VPCFlowLogs
Subscribe to
CloudwatchLogs and
analyze logs in Real-Time
Anomaly Detection
30. Amazon Kinesis Analytics
Apply SQL on streams: Easily connect to a Kinesis Stream or Firehose Delivery
Stream and apply SQL skills.
Build real-time applications: Perform continual processing on streaming big data
with sub-second processing latencies.
Easy Scalability : Elastically scales to match data throughput.
Connect to Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Kinesis Analytics can send processed data
to analytics tools so you can create alerts
and respond in real-time
31. Use SQL to build real-time applications
Easily write SQL code to process
streaming data
Connect to streaming source
Continuously deliver SQL results
32. Generate time series analytics
• Compute key performance indicates over time windows
• Combine with historical data in Amazon S3 or Redshift
Amazon
Kinesis
AnalyticsStreams
Firehose
Amazon
Redshift
Amazon
S3
Streams
Firehose
Custom ,
Real-time
Destinations
33. Feed real-time dashboards
• Validate and transform raw data, and then process to calculate
meaningful statistics
• Send processed data downstream for visualization in BI and
visualization services
Amazon
QuickSight
Amazon
Kinesis
Analytics
Amazon
Elasticsearch
Service
Amazon
Redshift
Amazon
RDS
Streams
Firehose
34. Create real-time alarms and notifications
• Build sequences of events from the stream like user sessions in a
clickstream or app behavior through logs
• Identify events (or a series of events) of interest and react to the
data through alarms and notifications
Amazon
Kinesis
AnalyticsStreams
Firehose
Streams
Amazon
SNS
Amazon
CloudWatch
AWS
Lambda
38. Our Journey
AWS Metering service
• 100s of millions of billing records per second
• Terabytes++ per hour
• Hundreds of thousands of sources
• For each customer: gather all metering records & compute monthly bill
• Auditors guarantee 100% accuracy at months end
Seemed perfectly reasonable to run as a batch, but relentless pressure for realtime…
With a Data Warehouse to load
• 1000s extract-transform-load (ETL) jobs every day
• Hundreds of thousands of files per load cycle
• Thousands of daily users, hundreds of queries per hour, demanding real-time insight
39. Our Journey
AWS Metering service
• 100s of millions of billing records per second
• Terabytes++ per hour
• Hundreds of thousands of sources
• For each customer: gather all metering records & compute monthly bill
• Auditors guarantee 100% accuracy at months end
Other Service Teams, Similar Requirements
• CloudWatch Logs and CloudWatch Metrics
• CloudFront API logging
• ‘Snitch’ internal datacenter hardware metrics
43. Conclusion
• Amazon Kinesis offers: managed service to build applications, streaming data
ingestion, and continuous processing
• Ingest aggregate data using Amazon Producer Library
• Process data using Amazon Connector Library and open source connectors
• Determine your partition key strategy
• Try out Amazon Kinesis at http://aws.amazon.com/kinesis/
• Building Streaming Analytics Pipeline
46. • Technical documentations
• Amazon Kinesis Agent
• Amazon Kinesis Streams and Spark Streaming
• Amazon Kinesis Producer Library Best Practice
• Amazon Kinesis Firehose and AWS Lambda
• Building Near Real-Time Discovery Platform with Amazon Kinesis
• Public case studies
• Glu mobile – Real-Time Analytics
• Hearst Publishing – Clickstream Analytics
• How Sonos Leverages Amazon Kinesis
• Nordstorm Online Stylist
Reference
47. Try out Amazon Kinesis
• http://aws.amazon.com/kinesis/
Thumb through the Developer Guide
• http://aws.amazon.com/documentation/kinesis/
Test drive the sample app
• https://github.com/awslabs/amazon-kinesis-data-visualization-sample
Kinesis Connector Framework
• https://github.com/awslabs/amazon-kinesis-connectors
Read EMR-Kinesis FAQs
• http://aws.amazon.com/elasticmapreduce/faqs/#kinesis-connector
Visit, and Post on Kinesis Forum
• https://forums.aws.amazon.com/forum.jspa?forumID=169#
Getting Started with Kinesis – Useful Links
Notes de l'éditeur
Amazon Kinesis Platform – The Complete Overview
Real-Time Streaming Analytics became popular amongst many verticals and use cases. In AdTech, Gaming, Financial Service and IoT, AWS customers are leveraging Amazon Kinesis platform to ingest billions of events every day and process them in real-time. In this session, we will discuss Amazon Kinesis Streams, Amazon Kinesis Firehose and Amazon Kinesis Analytics. We will show best practice and design patterns in integrating Amazon Kinesis platform with other services like Amazon EMR, Redshift, Amazon Elasticsearch and AWS lambda as well as 3rd party connectors like storm, Spark and more.
Amazon Kinesis Firehose – Deep Dive
Amazon Kinesis Firehose was launched in the end of 2015 and provides easy way for customers to ingest streaming data to Amazon Redshift, Amazon Elasticsearch and Amazon S3. In 2016, we introduced new features that helps customers to implement in-line processing within Kinesis firehose using AWS Lambda. Amazon Kinesis Firehose makes it easy to load real-time, streaming data into AWS without having to build custom stream processing applications. This session is designed for developers, data engineers and analysts who are looking to load, analyze and get powerful insights from real0time streaming data using existing analytics tools. We will explore key features and walk through how to use Kinesis Firehose to ingest streaming data into Amazon Kinesis Analytics, Amazon S3, Amazon Redshift and Amazon Elasticsearch service.
October 2015, re:Invent we introduced Amazon Kinesis Firehose, a fully managed service that delivers streaming data to S3, Redshift, and other data sources for downstream processing and analytics.
We will cover:
Key concepts
Experience for S3 and Redshift
Putting data using the Amazon Kinesis Agent and Key APIs
Pricing
Data delivery patterns
Understanding key metrics
Troubleshooting
Narrative: The reality is that most data is produced continuously and is coming at us at lightning speeds due to an explosive growth of real-time data sources.
TP: Machine data will make up 40% of our digital universe by 2020
Narrative: Whether it is log data coming from mobile and web applications, purchase data from ecommerce sites, or sensor data from IoT devices, it all delivers information that can help companies learn about what their customers, organization, and business are doing right now.
TP: Customer Benefits
Improve operational efficiencies, improve customer experiences, new business models
Smart building: reduce energy costs, cut maintenance, increase safety and security
Smart textiles: monitor skin temperature, monitor stress
Narrative: So how much is this data worth? Well, it depends…
Recent data is highly valuable
If you act on it in time
Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
If you have the means to combine them
Narrative: Processing real-time data as it arrives can let you make decisions much faster and get the most value from your data. But, building your own custom applications to process streaming data is complicated and resource intensive. You need to train or hire developers with the right skillsets, and then wait for months for the applications to be built and fine-tuned, and the operate and scale the application as the business grows.
All of this takes lots of time and money, and, at the end of the day, lots of companies just never get there, settle for the status-quo, and live with information that is hours or days old.
Narrative: You need a different set of analytical tools to collect and analyze real-time streaming data than what you have traditionally used for data at rest. With traditional analytics, you gather the information, store it in a database, and analyze it hours, days, or weeks later. Analyzing real-time data requires a different approach. Instead of running database queries on stored data, streaming analytics platforms have to process the data continuously and before the data lands in a database. And streaming data comes in at an incredible rate that can vary up and down all the time. Streaming analytics platforms have to be able to process this data when it arrives, often at speeds of millions and even tens of millions of events per hour.
Key requirements of stream processing
Durable: Durable ingest so that processing can be repeatable;
Continuous - Need to always be processing the latest data
Fast: Frequency (micro batches, size of batches, true streaming), and speed (sub-second, minute, hour)
Correct: at most once, at least once, and exactly once processing; event time, ingest time, processing time.
Reactive: Ability to process and respond in near real-time; feedback mechanisms to send processed data to live applications
Reliable: Highly available, fast failovers
The benefits of real-time processing apply to pretty much every scenario where new data is being generated on a continual basis.
We see it pervasively across all the big data areas we’ve looked at, and in pretty much every industry segment whose customers we’ve spoken to.
Customers tend to start with mundane, unglamorous applications, such as system log ingestion and processing.
They then progress over time to ever-more sophisticated forms of real-time processing.
Initially they may simply process their data streams to produce simple metrics and reports, and perform simple actions in response, such as emitting alarms over metrics that have breached their warning thresholds.
Eventually they start to do more sophisticated forms of data analysis in order to obtain deeper understanding of their data, such as applying machine learning algorithms.
In the long run they can be expected to apply a multiplicity of complex stream and event processing algorithms to their data.
We then looked at specific use cases
Alert when public sentiment changes
Customers want to respond quickly to social media feeds
Twitter Firehose: ~450 million events per second, or 10.1 MB/sec1
Respond to changes in the stock market
Customers want to compute value-at-risk or rebalance portfolios based on stock prices
NASDAQ Ultra Feed: 230 GB/day, or 8.1 MB/sec1
Aggregate data before loading in external data store
Customers have billions of click stream records arriving continuously from servers
Large, aggregated PUTs to S3 are cost effective
Internet marketing company: 20MB/sec
Recommendation and Personalization engines
Easy to use: Focus on quickly launching data streaming applications instead of managing infrastructure.
Real-Time: Collect real-time data streams and promptly respond to key business events and operational triggers.
Flexible: Choose the service, or combination of services, for your specific data streaming use cases.
November 2013, re:Invent we introduced Amazon Kinesis to the world
A platform of services to work with streaming data.
Easy to use: Focus on quickly launching data streaming applications instead of managing infrastructure.
Flexible: Choose the service, or combination of services, for your specific data streaming use cases.
Sonos runs near real-time streaming analytics on device data logs from their connected hi-fi audio equipment.
Hearst: Analyzing 30TB+ clickstream data enabling real-time insights for Publishers.
Nordstorm recommendation team built online stylist using Amazon Kinesis Streams and AWS Lambda.
Lots of way to process streaming data, just like there are lots of ways to process batch data with Hadoop
Open source from AWS
Moving time window buffer and oldest expire after 24 hours.
Kinesis Streams
VPC- Endpoints
Auto-scaled streams based on incoming workloads – no manual Shard changes
Support for more than5 reads/ second – upto 15~20 reads per second to enable more simultaneous consuming applications
The benefits of real-time processing apply to pretty much every scenario where new data is being generated on a continual basis.
We see it pervasively across all the big data areas we’ve looked at, and in pretty much every industry segment whose customers we’ve spoken to.
Customers tend to start with mundane, unglamorous applications, such as system log ingestion and processing.
They then progress over time to ever-more sophisticated forms of real-time processing.
Initially they may simply process their data streams to produce simple metrics and reports, and perform simple actions in response, such as emitting alarms over metrics that have breached their warning thresholds.
Eventually they start to do more sophisticated forms of data analysis in order to obtain deeper understanding of their data, such as applying machine learning algorithms.
In the long run they can be expected to apply a multiplicity of complex stream and event processing algorithms to their data.
We then looked at specific use cases
Alert when public sentiment changes
Customers want to respond quickly to social media feeds
Twitter Firehose: ~450 million events per second, or 10.1 MB/sec1
Respond to changes in the stock market
Customers want to compute value-at-risk or rebalance portfolios based on stock prices
NASDAQ Ultra Feed: 230 GB/day, or 8.1 MB/sec1
Aggregate data before loading in external data store
Customers have billions of click stream records arriving continuously from servers
Large, aggregated PUTs to S3 are cost effective
Internet marketing company: 20MB/sec
Recommendation and Personalization engines
Connect to streaming source Select Kinesis Streams or Firehoses as input data for your SQL code
Easily write SQL code to process streaming data - Author applications with a single SQL statement, or build sophisticated applications with multi-step pipelines with advanced analytical functions
Continuously deliver SQL results - Configure one to many outputs for your processed data for real-time alerts, visualizations, and batch analytics
The are three major components of a streaming application; connecting to an input stream, SQL code, and an output destinations.
Inputs include Kinesis Streams and Firehose.
Standard SQL code consists of one or more SQL queries which make up an application. Most applications will have one to three SQL statements that perform ETL after initial schematization, and then several SQL queries to generate real-time analytics on the transformed data. This illustrates the point of the “streaming application” concept; you are chaining SQL statements together in serial and parallel through filters, joins, and merges to build a streaming graph within your application. We believe most customers will have between 5-10 SQL statements, but some customers will build sophisticated applications with 50-100 statements.
Outputs includes Kinesis Streams and Firehose (S3, Redshift, and Elasticsearch(coming Chicago Summit)). The next outputs will be CloudWatch Metrics and Quicksight, but we have yet to confirm whether these will be place for GA.
This type of application is useful in many scenarios, such as:
Continuously computing business metrics like the number of purchases, sending results to a data warehouse like Amazon Redshift for visualization with a BIT tool like Amazon Quicksight.
OR calculating the number application errors and sending results to Amazon Elasticsearch for visualization with Kibana.
Amazon Kinesis Analytics provides the mechanisms to take raw, real-time data and output processed, meaningful information to a variety of destinations in real-time.