Analytics on write is a four-step process where we:
1. Read our events from our event stream
2. Analyze our events using a stream processing framework
3. Write the summarized output of our analysis to some form of storage target
4. Serve the summarized output into real-time dashboards, reports and similar
In this talk I introduced the concept of analytics on write, compared it to analytics on read and then talked through an example implementation using Amazon Kinesis and AWS Lambda.
2. Introducing myself
• Alex Dean
• Co-founder and technical lead at Snowplow,
the open-source event analytics platform
based here in London [1]
• Weekend writer of Unified Log Processing,
available on the Manning Early Access Program
[2]
[1] https://github.com/snowplow/snowplow
[2] http://manning.com/dean
4. It’s easier to start by explaining analytics on read, which
is much more widely practised and understood
1. Write all of our events to some kind of event store
2. Read the events from our event store to perform some analysis
5. In analytics on write, the analysis is performed on the
events in-stream (i.e. before reaching storage)
• Read our events from our event stream
• Analyze our events using a stream processing framework
• Write the summarized output of our analysis to some storage target
• Serve the summarized output into real-time dashboards, reports etc
6. Analytics on write and analytics on read are good at
different things, and leverage different technologies
7. With a unified log powered by Kafka or Kinesis, you can
apply both analytical approaches to your event stream
• Apache Kafka and Amazon Kinesis make it easy to have multiple
consuming apps on the same event stream
• Each consuming app can maintain its own “cursor position” on the
stream
9. What are some good use cases for getting started with
analytics on write?
Low-latency operational reporting, which must be fed
from the incoming event streams in as close to real-
time as possible
Dashboards to support thousands of simultaneous
users, for example a freight company might share a a
parcel tracker on its website for customers
Others? Please share your thoughts!
1
2
?
10. Analytics on write is a very immature space – there’s only
a handful of tools and frameworks available so far…
PipelineDB
• Analytics on write (“continuous views”) using SQL
• Implemented as a Postgres fork
• Supports Kafka but no sharding yet (I believe)
amazon-kinesis-aggregators
• Reads from Kinesis streams and outputs to DynamoDB & CloudWatch
• JSON-based query recipes
• Written by Ian Meyers here in London
Druid
• Hybrid analytics on write, analytics on read
• Very rich JSON-based query language
• Supports Kafka
11. … or we can implement a bespoke analytics on write
solution – for example with AWS Lambda
• The central idea of AWS Lambda is that
developers should be writing functions
not servers
• With Lambda, we write self-contained
functions to process events, and then
we publish those functions to Lambda
to run
• We don’t worry about developing,
deploying or managing servers –
instead, Lambda takes care of auto-
scaling our functions to meet the
incoming event volumes
12. An AWS Lambda function is stateless and exists only for the side
effects that it performs
14. Let’s imagine that we have a global delivery company called
OOPS, which has five event types
15. OOPS management want a near-real time dashboard to tell
them two things
Where are each of our delivery trucks now?
How many miles has each of our delivery trucks
driven since its last oil change?
1
2
22. To simplify the demo, I performed some configuration
steps already (1/2)
1. Downloaded the Scala code from
https://github.com/alexanderdean/Unified-Log-
Processing/tree/master/ch11/11.2/aow-lambda
2. Built a “fatjar” for my Lambda function ($ sbt assembly)
3. Uploaded my fatjar to Amazon S3 ($ aws s3 cp …)
4. Ran a CloudFormation template to setup permissions for my
Lambda, available herehttps://ulp-
assets.s3.amazonaws.com/ch11/cf/aow-lambda.template
5. Registered my Lambda function with AWS Lambda ($ aws
lambda create-function …)
23. To simplify the demo, I performed some configuration
steps already (2/2)
6. Created a Kinesis stream ($ aws kinesis create-stream
…)
7. Created a DynamoDB table ($ aws dynamodb create-
table …)
8. Configured the registered Lambda function to use the Kinesis
stream as its input ($ aws lambda create-event-
source-mapping --event-source-arn
${stream_arn} --function-name AowLambda --
enabled --batch-size 100 --starting-position
TRIM_HORIZON)
24. Finally, let’s feed in some OOPS events…
host$ vagrant ssh
guest$ cd /vagrant/ch11/11.1
guest$ ./generate.py
Wrote DriverDeliversPackage with timestamp 2015-01-11 00:49:00
Wrote DriverMissesCustomer with timestamp 2015-01-11 04:07:00
Wrote TruckArrivesEvent with timestamp 2015-01-11 04:56:00
Wrote DriverDeliversPackage with timestamp 2015-01-11 06:16:00
Wrote TruckArrivesEvent with timestamp 2015-01-11 07:35:00
25. … and check our Kinesis stream, Lambda function and
DynamoDB table
27. Further reading
Chapter 11, Analytics on write
Manning Deal of the Day today!
Discount code: dotd110415au (50% off
just today)
• https://www.pipelinedb.com/
• https://github.com/awslabs/amazon-kinesis-aggregators/
• http://druid.io/
• https://github.com/snowplow/aws-lambda-nodejs-example-project
• https://github.com/snowplow/aws-lambda-scala-example-project
• https://github.com/snowplow/spark-streaming-example-project
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log