SlideShare une entreprise Scribd logo
1  sur  57
Télécharger pour lire hors ligne
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roy Ben-Alta, Business Development Manager, AWS
Rick McFarland, VP of Data Services, Hearst
October 2015
BDT306
The Life of a Click
How Hearst Publishing Manages
Clickstream Analytics with AWS
What to Expect from the Session
• Common patterns for clickstream analytics
• Tips on using Amazon Kinesis and
Amazon EMR for clickstream processing
• Hearst’s big data journey in building the Hearst analytics
stack for clickstream
• Lesson learned
• Q&A
Clickstream Analytics = Business Value
Verticals/Use
Cases
Accelerated Ingest-
Transform-Load to final
destination
Continual Metrics/
KPI Extraction
Actionable Insights
Ad Tech/
Marketing Analytics
Advertising data aggregation Advertising metrics like coverage,
yield, conversion, scoring
webpages
User activity engagement
analytics, optimized bid/ buy
engines
Consumer Online/
Gaming
Online customer engagement data
aggregation
Consumer/ app engagement
metrics like page views, CTR
Customer clickstream analytics,
recommendation engines
Financial Services
Digital assets
Improve customer experience on
bank website
Financial market data metrics Fraud monitoring, and value-at-
risk assessment, auditing of
market order data
IoT / Sensor Data Fitness device , vehicle sensor,
telemetry data ingestion
Wearable sensor operational
metrics, and dashboards
Devices / sensor operational
intelligence
DataXu Records
68.198.92 - - [22/Dec/2013:23:08:37 -0400] "GET
/ HTTP/1.1" 200 6394 www.yahoo.com
"-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1...)" "-"
192.168.198.92 - - [22/Dec/2013:23:08:38 -0400] "GET
/images/logo.gif HTTP/1.1" 200 807 www.yahoo.com
"http://www.some.com/" "Mozilla/4.0 (compatible; MSIE 6...)" "-"
192.168.72.177 - - [22/Dec/2013:23:32:14 -0400] "GET
APACHE ACCESS LOG
{"cId":"10049","cdid":"5961","campID":"8","loc":"b","ip_address":"174.56.106.10
","icctm_ht_athr":"","icctm_ht_aid":"","icctm_ht_attl":"Family
Circus","icctm_ht_dtpub":"2011-04-05","icctm_ht_stnm":"SEATTLE POST-
INTELLIGENCER","icctm_ht_cnocl":"http://www.seattlepi.com/comics-and-
games/fun/Family_Circus","ts":"1422839422426","url":"http://www.seattlepi.co
m/comics-and-
games/fun/Family_Circus","hash":"d98ace5874334232f6db3e1c0f8be3ab","load"
:"5.096","ref":"http://www.seattlepi.com/comics-and-
games","bu":"HNP","brand":"SEATTLE POST-
INTELLIGENCER","ref_type":"SAMESITE","ref_subtype":"SAMESITE","ua":"deskto
p:chrome"}
JSON
Clickstream Record
Number of fields is not fixed
Tags names change
Multiple pages/sites
Format can be defined as we
store the data
AVRO, CSV, TSV, JSON
Clickstream Analytics Is the New “Hello World”
Hello World Word count Clickstream
Clickstream Analytics – Common Patterns
Flume HDFS Hive Batch high latency on retrieve
SQLFlume HDFS
Hive &
Pig
Batch low latency on
retrieve
Flume
Sqoop
HDFS
Impala
SparkSql
Presto
Other
More options: Batch with lower
latency when retrieve
users
Amazon
Kinesis
Kinesis-
enabled app
Amazon S3 Amazon
EMR
Web
Servers
Amazon S3
Amazon Redshift
It’s All About the Pace, About the Pace…
Big data
Hourly server logs:
were your systems misbehaving 1hr ago
Weekly / monthly bill:
what you spent this billing cycle
Daily customer-preferences report from your web
site’s click stream:
what deal or ad to try next time
Daily fraud reports:
was there fraud yesterday
Real-time big data
•Amazon CloudWatch metrics:
what went wrong now
•Real-time spending alerts/caps:
prevent overspending now
•Real-time analysis:
what to offer the current customer now
•Real-time detection:
block fraudulent use now
Clickstream Storage and Processing with
Amazon Kinesis
Amazon Kinesis
App N
Live dashboard
AWSendpoint
App 1
Aggregate and ingest
data to S3
App 2
Aggregate and
ingest data to
Amazon Redshift
Data lake
Amazon Redshift
App 3
ETL/ELT
Machine learning
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
EMR
DynamoDB
Amazon
EMR
Managed, elastic Hadoop (1.x & 2.x) cluster
Integrates with Amazon S3, Amazon DynamoDB, and
Amazon Redshift
Install Storm, Spark, Hive, Pig, Impala, and end user
tools automatically
Support for Spot instances
Integrated HBase NOSQL database
Amazon EMR with Apache Spark
Apache Spark
Spark
SQL
Spark
Streaming
Mllib
GraphX
Spot Integration with Amazon EMR
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
Spot Integration with Amazon EMR
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140
Resize Nodes with Spot Instances
Add 10 more nodes on Spot
Resize Nodes with Spot Instances
20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105
Resize Nodes with Spot Instances
50 % less run-time ( 14  7)
25% less cost (140  105)
Amazon EMR and Amazon Kinesis for Batch and Interactive Processing
• Streaming log analysis
• Interactive ETL
Amazon Kinesis Amazon EMR
Amazon Redshift
Amazon S3
Data scientist
Amazon EMR for data scientists
using Spot instances
BI
Amazon Beanstalk - App to push data into Amazon Kinesis
• Amazon software license linking – Add ASL dependency
to SBT/MAVEN project (artifactId = spark-streaming-
kinesis-asl_2.10)
• Shards - Include head-room to catching up with data in
stream
• Tracking Amazon Kinesis application state (DynamoDB)
• Kinesis-Application:DynamoDB-table (1:1)
• Created automatically
• Make sure application name doesn’t conflict with existing DynamoDB tables.
• Adjust DynamoDB provision throughput if necessary (default 10 reads per
sec & 10 writes per second)
Amazon Kinesis Applications – Tips
Spark on Amazon EMR - Tips
• Amazon EMR applications after version 3.8.0 (no need
to run bootstrap actions)
• Use Spot instances for time & cost saving especially
when using Spark
• Run in Yarn cluster mode (--master yarn-cluster) for
production jobs – Spark driver runs in application master
(high availability)
• Data serialization – use Kryo if possible to boost
performance
(spark.serializer=org.apache.spark.serializer.KryoSerializer)
The Life of a Click at Hearst
• Hearst’s journey with their big data analytics platform on
AWS
• Demo
• Clickstream analysis patterns
• Lessons learned
Have you heard of Hearst?
BUSINESS MEDIA
operates more than 20 business-to-businesses with significant holdings in the
automotive, electronic, medical and finance industries
MAGAZINES
publishes 20 U.S. titles and close to 300 international editions
BROADCASTING
comprises 31 television and two radio stations
NEWSPAPERS
owns 15 daily and 34 weekly newspapers
Hearst includes over 200 businesses in over
100 countries around the world
Data Services at Hearst – Our Mission
• Ensure that Hearst leverages its combined data
assets
• Unify Hearst’s data streams
• Development of Big Data Analytics Platform using AWS
services
• Promote enterprise-wide product development
• Example: product initiative led by all of Hearst’s editors
– Buzzing@Hearst
1
Business Value of Buzzing
• Instant feedback on articles from our audiences
• Incremental re-syndication of popular articles across
properties (e.g. trending newspaper articles can be
adopted by magazines)
• Inform the editors to write articles that are more
relevant to our audiences and what channels are our
audiences leveraging to read our articles
• Ultimately, drive incremental value
• 25% more page views, 15% more visitors which
lead to incremental revenue
• Throughput goal: transport data from all 250+ Hearst
properties worldwide
• Latency goal: click-to-tool in under 5 minutes
• Agile: easily add new data fields into clickstream
• Unique metrics requirements defined by Data Science team
(e.g., standard deviations, regressions, etc.)
• Data reporting windows ranging from 1 hour to 1 week
• Front-end developed “from scratch” so data exposed through
API must support development team’s unique requirements
Most importantly, operation of existing sites cannot be
affected!
Engineering Requirements of Buzzing…
What we had to work with…
a ”static” clickstream collection process on many Hearst sites
Users to
Hearst
Properties
Clickstream
corporate
data center
Netezza
Data
Warehouse
Once per day
…now how do we get there?
Used for ad hoc
SQL-based
reporting and
analytics
~30 GB per day
containing basic web
log data (e.g., referrer,
url, user agent, cookie,
etc.)
…and we own Hearst’s tag management system
Users to
Hearst
Properties
Clickstream
This not only gave us access
to the clickstream but also
the JavaScript code that
lives on our websites
JavaScript on
web pages
Phase 1 – Ingest Clickstream Data Using AWS
Amazon
Kinesis
Node.JS App-
Proxy
Kinesis S3 App –
KCL Libraries
Users to
Hearst
Properties
Clickstream
“Raw JSON”
Raw data
Use tag manager to easily deploy JavaScript to all sites
Kinesis Client
Libraries and
Kinesis
Connectors
persist data to
Amazon S3
for durability
ElasticBeanstalk with
Node.JS exposes an
HTTP endpoint which
asynchronously takes
the data and feeds to
Amazon Kinesis
Implement
JavaScript on
sites that call
an exposed
endpoint and
pass in query
parameters
Node.JS – Push clickstream to Amazon Kinesis
function pushToKinesis(data) {
var params = {
Data: data, /* required */
PartitionKey: guid(),
StreamName: streamName /* required */
};
kinesis.putRecord(params, function(err, data) {
if (err) {
console.log(err, err.stack); // an error occurred
}
});
}
app.get('/hearstkin.gif', function(req, res){
async.series([function(callback){
var queryData = url.parse(req.url, true).query;
queryData.proxyts = new Date().getTime().toString();
pushToKinesis(JSON.stringify(queryData));
callback(null);
}]);
res.writeHead(200,{'Content-Type': 'text/plain', 'Access-
Control-Allow-Origin': '*'});
res.end(imageGIF, 'binary');
});
http.createServer(app).listen(app.get('port'), function(){
console.log('Express server listening on port ' +
app.get('port'));
});
Asynchronous calls –
ensures no user experience
interruption
Server timestamp – to
create a unified timestamp.
Amazon Kinesis now offers
this out-of-the box!
JSON format – this
helps us downstream
Kinesis Partition Key – guid() is a
good partition key to ensure even
distribution across the shards
Ingest Monitoring - AWS
Amazon Kinesis Monitoring
AWS Elastic Beanstalk Monitoring
Auto Scaling triggered by
network in > 20MB. Then scale
up to 40 instances.
Phase 1- Summary
• Use JSON formatting for payloads so more fields can be easily
added without impacting downstream processing
• HTTP call requires minimal code introduced to the actual site
implementations
• Flexible to meet rollout and growing demand
• Elastic Beanstalk can be scaled
• Amazon Kinesis stream can be re-sharded
• Amazon S3 provides high durability storage for raw data
• Once a reliable, scalable onboarding platform is in place,
we can now focus on ETL!
Phase 2a- Data Processing First Version (EMR)
ETL on
Amazon EMR
“Raw
JSON”
Raw Data Clean Aggregate
Data
• Amazon EMR was
chosen initially for
processing due to ease
of Amazon EMR creation
… and Pig because we
knew how to code in
PigLatin
• 50+ UDFs were written
using Python…also
because we knew Python
Unfortunately, Pig was not performing well – 15 min latency
Processing Clickstream Data with Pig
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
REGISTER '/home/hadoop/PROD/parser.py' USING jython as pyudf;
REGISTER '/home/hadoop/PROD/refclean.py' USING jython AS refs;
AA0 = load 's3://BUCKET_NAME/rawjson/datadata.tsv.gz' using TextLoader as
(line:chararray);
A0 = FILTER AA0 BY ( pyudf.get_obj(line,'url') MATCHES '.*(ad+|gd+).*';
A1 = FOREACH A0 GENERATE
( pyudf.urlclean(pyudf.get_obj(line,'url')) as url:chararray,
pyudf.get_obj(line,'hash') as hash:chararray,
pyudf.get_obj(line,'icxid') as icxid:chararray,
pyudf.pubclean(pyudf.get_obj(line,'icctm_ht_dtpub')) as pubdt:chararray,
pyudf.get_obj(line,'icctm_ht_cnocl') as cnocl:chararray,
pyudf.get_obj(line,'icctm_ht_athr') as author:chararray,
pyudf.get_obj(line,'icctm_ht_attl') as title:chararray,
pyudf.get_obj(line,'icctm_ht_aid') as cms_id:chararray,
pyudf.num1(pyudf.get_obj(line,'mxdpth')) as mxdpth:double,
pyudf.num2(pyudf.get_obj(line,'load')) as topsecs:double,
refs.classy(pyudf.get_obj(line,'url'),1) as bu:chararray,
pyudf.get_obj(line,'ip_address') as ip:chararray,
pyudf.get_obj(line,'img') as img:chararray
;
Gzip your
output
Regex in
Pig!
Python imports
limited to what is
allowed by
Jython
Phase 2b- Data Processing (SparkStreaming)
Clean Aggregate Data
Node.JS App-
ProxyUsers to
Hearst
Properties
Clickstream
• Welcome Apache Spark– one framework for
batch and realtime
• Benefits – using same code for batch and
real time ETL
• Use Spot instances – cost savings
• Drawbacks – Scala!
Amazon Kinesis
ETL on EMR
Using SQL with Scala
SparkSQL
Since we
knew SQL,
we wrote
Scala with
embedded
SQL Query
endpointUrl = kinesis.us-west-2.amazonaws.com
streamName= hearststream
outputLoc.json.streaming =
s3://hearstkinesisdata/processedsparkjson
window.length = 300
sliding.interval = 300
outputLimit = 5000000
query1Table=hearst1
query1= SELECT 
simplestartq(proxyts, 5) as startq,
urlclean(url) as url,
hash,
icxid,
pubclean(icctm_ht_dtpub) as pubdt,
classy(url,1) as bu,
ip_address as ip,
artcheck(classy(url,1),url) as artcheck,
ref_type(ref,url) as ref_type,
img, 
wc, 
contentSource 
FROM hearst1
val jsonRDD = sqlContext.jsonRDD(rdd1)
jsonRDD.registerTempTable(query1Table.trim)
val query1Result = sqlContext.sql(query1)//.limit(outputLimit.toInt)
query1Result.registerTempTable(query2Table.trim)
val query2Result = sqlContext.sql(query2)
query2Result.registerTempTable(query3Table.trim)
val query3Result = sqlContext.sql(query3).limit(outputLimit.toInt)
val outPartitionFolder = UDFUtils.output60WithRolling(slidingInterval.toInt)
query3Result.toJSON.saveAsTextFile("%s/%s".format(outputLocJSON,
outPartitionFolder), classOf[org.apache.hadoop.io.compress.GzipCodec])
logger.info("New JSON file written to "+outputLoc+"/"+outPartitionFolder)
Python UDF versus Scala
Python
def artcheck(bu,url):
try:
if url and bu:
cleanurl = url[0:url.find("?")].strip('/')
tailurl = url[findnth(url, '/', 3)+1:url.find("?")].strip('/')
revurl=cleanurl[::-1]
root=revurl[0:revurl.find('/')][::-1]
if (bu=='HMI' or bu=='HMG') and re.compile('ad+|gd+').search(tailurl)!=None : return 'T'
elif bu=='HTV' and root.isdigit()==True and re.compile('/search/').search(cleanurl)==None: return 'T'
elif bu=='HNP' and re.compile('blog|fuelfix').search(url)!=None and re.compile(r'S*[0-9]{4,4}/[0-9]{2,2}/[0-9]{2,2}S*').search(tailurl)!=None : return 'T'
elif bu=='HNP' and re.compile('businessinsider').search(url)!=None and re.compile(r'S*[0-9]{4,4}-[0-9]{2,2}').search(root)!=None : return 'T'
elif bu=='HNP' and re.compile('blog|fuelfix|businessinsider').search(url)==None and re.compile('.php').search(url)!=None : return 'T'
else : return 'F'
else : return 'F'
except:
return 'F'
def artcheck(bu:String, url: String )={
try{
val cleanurl = UDFUtils.utilurlclean(url.trim).stripSuffix("/")
val pathClean = UDFUtils.pathURI(cleanurl)
val lastContext = pathClean.split("/").last
var resp = "F"
if(("HMI"==bu||"HMG"==bu)&&Pattern.compile("/ad+|/gd+").matcher(pathClean).find()) resp="T"
else if("HTV"==bu && StringUtils.isNumeric(lastContext) && !cleanurl.contains("/search/")) resp="T"
else if("HNP"==bu && Pattern.compile("blog|fuelfix").matcher(url).find() && Pattern.compile("d{4}/d{2}/d{2}").matcher(pathClean).find()) resp="T"
else if("HNP"==bu && Pattern.compile("businessinsider").matcher(url).find() && Pattern.compile("d{4}-d{2}").matcher(lastContext).find()) resp="T"
else if("HNP"==bu && !Pattern.compile("blog|fuelfix|businessinsider").matcher(url).find()&& Pattern.compile(".php").matcher(url).find()) resp="T"
resp}
}
catch{
case e:Exception => "F"
}
Scala
Don’t be intimidated by Scala…if
you know Python, the syntax can
be similar
re.compile('ad+|gd+').
Pattern.compile("ad+|gd+").
Try: Except:
Try{} Catch{}
Phase 3a- Data Science!
Data Science on EC2
Clean Aggregate Data API-ready Data
Amazon Kinesis
ETL on EMR
• We decided to perform our
Data Science using SAS
on Amazon EC2 initially
because of the ability to
perform both data
manipulation and easily
run complex data science
techniques (e.g.,
regressions)
• Great for exploration and
initial development
• Performing data science
using this method took
3-5 minutes to complete
SAS Code Example
data _null_;
call system("aws s3 cp s3://BUCKET_NAME/file.gz
/home/ec2-user/LOGFILES/file.gz");
run;
FILENAME IN pipe "gzip -dc /home/ec2-user/LOGFILES/file.gz" lrecl=32767;
data temp1;
FORMAT startq DATETIME19.;
infile IN delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=1;
input
startq :YMDDTTM.
url :$1000.
pageviews :best32.
visits :best32.
author :$100.
cms_id :$100.
img :$1000.
title :$1000.;
run;
Use pipe to
read in S3
data and
keep it
compressed
proc sql;
CREATE TABLE metrics AS
SELECT
url FORMAT=$1000.,
SUM(pageviews) as pageviews,
SUM(visits) as visits,
SUM(fvisits) as fvisits,
SUM(evisits) as evisits,
MIN(ttct) as rec,
COUNT(distinct startq) as frq,
AVG(visits) as avg_visits_pp,
SUM(visits1) as visits_soc,
SUM(visits2) as visits_dir,
SUM(visits3) as visits_int,
SUM(visits4) as visits_sea,
SUM(visits5) as visits_web,
SUM(visits6) as visits_nws,
SUM(visits7) as visits_pd,
SUM(visits8) as visits_soc_fb,
SUM(visits9) as visits_soc_tw,
SUM(visits10) as visits_soc_pi,
SUM(visits11) as visits_soc_re,
SUM(visits12) as visits_soc_yt,
SUM(visits13) as visits_soc_su,
SUM(visits14) as visits_soc_gp,
SUM(visits15) as visits_soc_li,
SUM(visits16) as visits_soc_tb,
SUM(visits17) as visits_soc_ot,
CASE WHEN (SUM(v1) - SUM(v3) ) > 20 THEN ( SUM(v1) - SUM(v3) ) / 2 ELSE 0 END as trending
FROM temp1
GROUP BY 1;
Use PROC SQL when
possible for easier
translation to Amazon
Redshift for production
later on.
Phase 3b- Split Data Science into Development and Production
Amazon
Kinesis
Clean Aggregate
Data
API-ready Data
Data Science
“Production”
Amazon Redshift
ETL on EMR
• Once Data Science
models were established,
we split the modeling and
production
• Production was moved to
Amazon Redshift which
provided much faster
ability to read Amazon S3
data and process the data
• Data Science processing
time went down to 100
seconds!
Use S3 to store
data science
models and
apply them
using Amazon
Redshift
Data Science
“Development”
on EC2
Statistical Models
run once per day
Models
Agg Data
select
clean_url as url,
trim(substring(max(proxyts||domain) from 20 for 1000)) as domain,
trim(substring(max(proxyts||clean_cnocl) from 20 for 1000)) as cnocl,
trim(substring(max(proxyts||img) from 20 for 1000)) as img,
trim(substring(max(proxyts||title) from 20 for 1000)) as title,
trim(substring(max(proxyts||section) from 20 for 1000)) as section,
approximate count(distinct ic_fpc) as visits,
count(1) as hits
from kinesis_hits
where bu='HMG' and (article_id is not null or author is not null or title is
not null)
group by 1;
Amazon Redshift Code Example
Cool trick to find the most
recent value of a
character field in one
pass through the data
Phase 4a- Elasticsearch Integration
Amazon EMR
PUSH
Buzzing API
S3 Storage
Data Science
Amazon Redshift
ETL on EMR
Since we had the Amazon EMR cluster running already, we used a handy Pig jar
that made it easy to push data to Elasticsearch.
S3 Storage
Models
Agg Data API Ready Data
REGISTER /home/hadoop/pig/lib/piggybank.jar;
REGISTER /home/hadoop/PROD/elasticsearch-hadoop-2.0.2.jar;
DEFINE EsStorageDEV org.elasticsearch.hadoop.pig.EsStorage
('es.nodes = es-dev.hearst.io',
'es.port = 9200',
'es.http.timeout = 5m',
'es.index.auto.create = true');
SECTIONS = load 's3://hearstkinesisdata/ss.tsv' USING PigStorage('t') as
(sectionid:chararray,cnt:long,visits:long,sectionname:chararray);
STORE SECTIONS INTO 'content-sections-sync/content-sections-sync' USING
EsStoragePROD;
Pig Code – Push to ES Example
Use handy Pig jar to push data to Elasticsearch
The “Amazon EMR overhead” required to read small files added 2 min to latency
Phase 4b- Elasticsearch Integration Sped Up
Buzzing API
S3 Storage
API
Ready
Data
Data Science
Amazon Redshift
ETL on EMR
Since the Amazon
Redshift code was
run in a Python
wrapper, solution
was to push data
directly into
Elasticsearch
Models
Agg Data
# Converting file into bulk-insert compatible format
$bin/convert_json.php big.json create rowbyrow.json
# Get mapping file
${aws} s3 cp S3://hearst/es_mapping es_mapping
# Creating new ES index
$(curl -XPUT http://es.hearst.io/content-web180-sync --data-binary es_mapping -s)
# Performing bulk API call
$(curl -XPOST http://es.hearst.io/content-web180-sync/_bulk --data-binary rowbyrow.json
-s) "http://es.hearst.io/content-web180-sync"
Script to Push to Elasticsearch Directly
Converting one big input JSON
file to a row-by-row JSON is a
key step for making the data
batch compatible
Use a mapping file to manage
the formatting in your index…
very important for dates and
numeric values that look like
strings
Final Data Pipeline
Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- Proxy
Users to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
100 seconds
1G/day
30 seconds
5GB/day
5 seconds
1G/day
Milliseconds
100GB/day
LATENCY
THROUGHPUT Models
Agg Data
Data Science
Amazon Redshift
ETL
A more “visual” representation of our pipeline!
Clickstream dataAmazon
Kinesis
Results
API
Version Transport
S
T
O
R
A
G
E ETL
S
T
O
R
A
G
E Analysis
S
T
O
R
A
G
E Exposure Latency
V1
Amazon
Kinesis S3 EMR-Pig S3 EC2-SAS S3
EMR to
ElasticSearch 1 hour
Today
Amazon
Kinesis Spark-Scala S3
Amazon
Redshift ElasticSearch <5 min
Tomorrow
Amazon
Kinesis PySpark + SparkR ElasticSearch <2 min
Lessons learned
“No Duh’s” Removing “stoppage” points, speed up processing, and
combine processes improve latency.
Data Science Tool Box
Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- Proxy
Users to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
Models
Agg Data
• IPython Notebook
• On Spark and Amazon Redshift
• Code sharing (and insights)
• User-friendly development
environment for data scientists
• Auto-convert .pynb  .py Data
Science
Toolbox
Data
Models
Amazon Redshift
Data Science at Hearst – Notebook
Next Steps
• Amazon EMR 4.1.0 with Spark 1.5 released and we can do
more with pyspark, look at Apache Zeppelin on Amazon EMR
• Amazon Kinesis just release a new feature to retain data up
to 7 days - We could do more ETL “in the stream”
• Amazon Kinesis Firehose and Lambda – Zero touch (no
Amazon EC2 maintenance)
• More complex data science that requires…
• Amazon Redshift UDFs
• Python shell that calls Amazon Redshift but also allows for
complex statistical methods (e.g., using R or machine learning)
Conclusion
• Clickstreams are the new “data currency” of business
• AWS provides great technology to process data
• High speed
• Lower costs – Using Spot…
• Very agile
• Do more with less: this can all be done with a team
of 2 FTEs!
• 1 developer (well versed in AWS) + 1 data scientist
Ingest Store Process Analyze
Click Insight
Time
Call To Action
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon
EMR
Amazon
Redshift
Use Amazon Kinesis, EMR and Amazon Redshift for
Clickstream
Open source connectors:
• http://docs.aws.amazon.com/kinesis/latest/dev/developing-
consumers-with-kcl.html
AWS Big Data blog
- http://blogs.aws.amazon.com/bigdata/
AWS re:Invent Big Data booth
AWS Big Data Marketplace and Partner ecosystem
Hearst Booth – Hall C1156: Learn more about the
interesting things we are doing with data!
Call To Action
Remember to complete
your evaluations!
Thank you!

Contenu connexe

Tendances

AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big DataAmazon Web Services
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
 
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Amazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012Amazon Web Services
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAmazon Web Services
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at ScaleModern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at ScaleAmazon Web Services
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAmazon Web Services
 
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...Amazon Web Services
 
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014Amazon Web Services
 
Structured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWSStructured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWSAmazon Web Services
 

Tendances (20)

AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution Showcase
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at ScaleModern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
 
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
 
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
 
Structured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWSStructured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWS
 

En vedette

Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersAlbert Hui
 
Deep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming ApplicationsDeep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming ApplicationsAmazon Web Services
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis FirehoseAmazon Web Services
 
DPACC Acceleration Progress and Demonstration
DPACC Acceleration Progress and DemonstrationDPACC Acceleration Progress and Demonstration
DPACC Acceleration Progress and DemonstrationOPNFV
 
Analytics & Reporting for Amazon Cloud Logs
Analytics & Reporting for Amazon Cloud LogsAnalytics & Reporting for Amazon Cloud Logs
Analytics & Reporting for Amazon Cloud LogsCloudlytics
 
World's best AWS Cloud Log Analytics & Management Tool
World's best AWS Cloud Log Analytics & Management ToolWorld's best AWS Cloud Log Analytics & Management Tool
World's best AWS Cloud Log Analytics & Management ToolCloudlytics
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...Amazon Web Services
 
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014Amazon Web Services
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWSDanilo Poccia
 
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKGDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKNate Wiger
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarAmazon Web Services
 
AWS_Architecture_e-commerce
AWS_Architecture_e-commerceAWS_Architecture_e-commerce
AWS_Architecture_e-commerceSEONGTAEK OH
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream Michel Bruley
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services
 
(GAM302) EA's Real-World Hurdles with Millions of Players in the Simpsons: Ta...
(GAM302) EA's Real-World Hurdles with Millions of Players in the Simpsons: Ta...(GAM302) EA's Real-World Hurdles with Millions of Players in the Simpsons: Ta...
(GAM302) EA's Real-World Hurdles with Millions of Players in the Simpsons: Ta...Amazon Web Services
 
AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine Amazon Web Services
 
AWS Data Transfer Services: Data Ingest Strategies Into the AWS Cloud
AWS Data Transfer Services: Data Ingest Strategies Into the AWS CloudAWS Data Transfer Services: Data Ingest Strategies Into the AWS Cloud
AWS Data Transfer Services: Data Ingest Strategies Into the AWS CloudAmazon Web Services
 
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk
(DVO312) Sony: Building At-Scale Services with AWS Elastic BeanstalkAmazon Web Services
 
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWSMatthew (정재화)
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014Amazon Web Services
 

En vedette (20)

Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
 
Deep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming ApplicationsDeep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming Applications
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
 
DPACC Acceleration Progress and Demonstration
DPACC Acceleration Progress and DemonstrationDPACC Acceleration Progress and Demonstration
DPACC Acceleration Progress and Demonstration
 
Analytics & Reporting for Amazon Cloud Logs
Analytics & Reporting for Amazon Cloud LogsAnalytics & Reporting for Amazon Cloud Logs
Analytics & Reporting for Amazon Cloud Logs
 
World's best AWS Cloud Log Analytics & Management Tool
World's best AWS Cloud Log Analytics & Management ToolWorld's best AWS Cloud Log Analytics & Management Tool
World's best AWS Cloud Log Analytics & Management Tool
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
 
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKGDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace Webinar
 
AWS_Architecture_e-commerce
AWS_Architecture_e-commerceAWS_Architecture_e-commerce
AWS_Architecture_e-commerce
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
 
(GAM302) EA's Real-World Hurdles with Millions of Players in the Simpsons: Ta...
(GAM302) EA's Real-World Hurdles with Millions of Players in the Simpsons: Ta...(GAM302) EA's Real-World Hurdles with Millions of Players in the Simpsons: Ta...
(GAM302) EA's Real-World Hurdles with Millions of Players in the Simpsons: Ta...
 
AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine
 
AWS Data Transfer Services: Data Ingest Strategies Into the AWS Cloud
AWS Data Transfer Services: Data Ingest Strategies Into the AWS CloudAWS Data Transfer Services: Data Ingest Strategies Into the AWS Cloud
AWS Data Transfer Services: Data Ingest Strategies Into the AWS Cloud
 
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk
 
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
스타트업 사례로 본 로그 데이터 분석 : Tajo on AWS
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
 

Similaire à (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Building a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWSBuilding a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWSInjae Kwak
 
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018Amazon Web Services
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon KinesisAmazon Web Services
 
Getting started with amazon kinesis
Getting started with amazon kinesisGetting started with amazon kinesis
Getting started with amazon kinesisJampp
 
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...Amazon Web Services
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...Amazon Web Services
 
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)Amazon Web Services
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon KinesisAmazon Web Services
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...Amazon Web Services
 
Path to the future #4 - Ingestão, processamento e análise de dados em tempo real
Path to the future #4 - Ingestão, processamento e análise de dados em tempo realPath to the future #4 - Ingestão, processamento e análise de dados em tempo real
Path to the future #4 - Ingestão, processamento e análise de dados em tempo realAmazon Web Services LATAM
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Web Services
 
Analyzing Real-time Streaming Data with Amazon Kinesis
Analyzing Real-time Streaming Data with Amazon KinesisAnalyzing Real-time Streaming Data with Amazon Kinesis
Analyzing Real-time Streaming Data with Amazon KinesisAmazon Web Services
 
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Web Services
 
Get Started with Real-Time Streaming Data in Under 5 Minutes - AWS Online Tec...
Get Started with Real-Time Streaming Data in Under 5 Minutes - AWS Online Tec...Get Started with Real-Time Streaming Data in Under 5 Minutes - AWS Online Tec...
Get Started with Real-Time Streaming Data in Under 5 Minutes - AWS Online Tec...Amazon Web Services
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Amazon Web Services
 
The Cloud - What's different
The Cloud - What's differentThe Cloud - What's different
The Cloud - What's differentChen-Tien Tsai
 

Similaire à (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS (20)

Real-Time Streaming Data on AWS
Real-Time Streaming Data on AWSReal-Time Streaming Data on AWS
Real-Time Streaming Data on AWS
 
Building a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWSBuilding a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWS
 
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon Kinesis
 
Getting started with amazon kinesis
Getting started with amazon kinesisGetting started with amazon kinesis
Getting started with amazon kinesis
 
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
 
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon Kinesis
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
 
Path to the future #4 - Ingestão, processamento e análise de dados em tempo real
Path to the future #4 - Ingestão, processamento e análise de dados em tempo realPath to the future #4 - Ingestão, processamento e análise de dados em tempo real
Path to the future #4 - Ingestão, processamento e análise de dados em tempo real
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Analyzing Real-time Streaming Data with Amazon Kinesis
Analyzing Real-time Streaming Data with Amazon KinesisAnalyzing Real-time Streaming Data with Amazon Kinesis
Analyzing Real-time Streaming Data with Amazon Kinesis
 
Building your Datalake on AWS
Building your Datalake on AWSBuilding your Datalake on AWS
Building your Datalake on AWS
 
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
 
Get Started with Real-Time Streaming Data in Under 5 Minutes - AWS Online Tec...
Get Started with Real-Time Streaming Data in Under 5 Minutes - AWS Online Tec...Get Started with Real-Time Streaming Data in Under 5 Minutes - AWS Online Tec...
Get Started with Real-Time Streaming Data in Under 5 Minutes - AWS Online Tec...
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
The Cloud - What's different
The Cloud - What's differentThe Cloud - What's different
The Cloud - What's different
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Roy Ben-Alta, Business Development Manager, AWS Rick McFarland, VP of Data Services, Hearst October 2015 BDT306 The Life of a Click How Hearst Publishing Manages Clickstream Analytics with AWS
  • 2. What to Expect from the Session • Common patterns for clickstream analytics • Tips on using Amazon Kinesis and Amazon EMR for clickstream processing • Hearst’s big data journey in building the Hearst analytics stack for clickstream • Lesson learned • Q&A
  • 3. Clickstream Analytics = Business Value Verticals/Use Cases Accelerated Ingest- Transform-Load to final destination Continual Metrics/ KPI Extraction Actionable Insights Ad Tech/ Marketing Analytics Advertising data aggregation Advertising metrics like coverage, yield, conversion, scoring webpages User activity engagement analytics, optimized bid/ buy engines Consumer Online/ Gaming Online customer engagement data aggregation Consumer/ app engagement metrics like page views, CTR Customer clickstream analytics, recommendation engines Financial Services Digital assets Improve customer experience on bank website Financial market data metrics Fraud monitoring, and value-at- risk assessment, auditing of market order data IoT / Sensor Data Fitness device , vehicle sensor, telemetry data ingestion Wearable sensor operational metrics, and dashboards Devices / sensor operational intelligence
  • 4. DataXu Records 68.198.92 - - [22/Dec/2013:23:08:37 -0400] "GET / HTTP/1.1" 200 6394 www.yahoo.com "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1...)" "-" 192.168.198.92 - - [22/Dec/2013:23:08:38 -0400] "GET /images/logo.gif HTTP/1.1" 200 807 www.yahoo.com "http://www.some.com/" "Mozilla/4.0 (compatible; MSIE 6...)" "-" 192.168.72.177 - - [22/Dec/2013:23:32:14 -0400] "GET APACHE ACCESS LOG {"cId":"10049","cdid":"5961","campID":"8","loc":"b","ip_address":"174.56.106.10 ","icctm_ht_athr":"","icctm_ht_aid":"","icctm_ht_attl":"Family Circus","icctm_ht_dtpub":"2011-04-05","icctm_ht_stnm":"SEATTLE POST- INTELLIGENCER","icctm_ht_cnocl":"http://www.seattlepi.com/comics-and- games/fun/Family_Circus","ts":"1422839422426","url":"http://www.seattlepi.co m/comics-and- games/fun/Family_Circus","hash":"d98ace5874334232f6db3e1c0f8be3ab","load" :"5.096","ref":"http://www.seattlepi.com/comics-and- games","bu":"HNP","brand":"SEATTLE POST- INTELLIGENCER","ref_type":"SAMESITE","ref_subtype":"SAMESITE","ua":"deskto p:chrome"} JSON Clickstream Record Number of fields is not fixed Tags names change Multiple pages/sites Format can be defined as we store the data AVRO, CSV, TSV, JSON
  • 5. Clickstream Analytics Is the New “Hello World” Hello World Word count Clickstream
  • 6. Clickstream Analytics – Common Patterns Flume HDFS Hive Batch high latency on retrieve SQLFlume HDFS Hive & Pig Batch low latency on retrieve Flume Sqoop HDFS Impala SparkSql Presto Other More options: Batch with lower latency when retrieve
  • 7. users Amazon Kinesis Kinesis- enabled app Amazon S3 Amazon EMR Web Servers Amazon S3 Amazon Redshift
  • 8. It’s All About the Pace, About the Pace… Big data Hourly server logs: were your systems misbehaving 1hr ago Weekly / monthly bill: what you spent this billing cycle Daily customer-preferences report from your web site’s click stream: what deal or ad to try next time Daily fraud reports: was there fraud yesterday Real-time big data •Amazon CloudWatch metrics: what went wrong now •Real-time spending alerts/caps: prevent overspending now •Real-time analysis: what to offer the current customer now •Real-time detection: block fraudulent use now
  • 9. Clickstream Storage and Processing with Amazon Kinesis Amazon Kinesis App N Live dashboard AWSendpoint App 1 Aggregate and ingest data to S3 App 2 Aggregate and ingest data to Amazon Redshift Data lake Amazon Redshift App 3 ETL/ELT Machine learning Availability Zone Shard 1 Shard 2 Shard N Availability Zone Availability Zone EMR DynamoDB
  • 10. Amazon EMR Managed, elastic Hadoop (1.x & 2.x) cluster Integrates with Amazon S3, Amazon DynamoDB, and Amazon Redshift Install Storm, Spark, Hive, Pig, Impala, and end user tools automatically Support for Spot instances Integrated HBase NOSQL database Amazon EMR with Apache Spark Apache Spark Spark SQL Spark Streaming Mllib GraphX
  • 11. Spot Integration with Amazon EMR aws emr create-cluster --name "Spot cluster" --ami-version 3.3 InstanceGroupType=MASTER, InstanceType=m3.xlarge,InstanceCount=1, InstanceGroupType=CORE, BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK, BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
  • 12. Spot Integration with Amazon EMR 10 node cluster running for 14 hours Cost = 1.0 * 10 * 14 = $140
  • 13. Resize Nodes with Spot Instances Add 10 more nodes on Spot
  • 14. Resize Nodes with Spot Instances 20 node cluster running for 7 hours Cost = 1.0 * 10 * 7 = $70 = 0.5 * 10 * 7 = $35 Total $105
  • 15. Resize Nodes with Spot Instances 50 % less run-time ( 14  7) 25% less cost (140  105)
  • 16. Amazon EMR and Amazon Kinesis for Batch and Interactive Processing • Streaming log analysis • Interactive ETL Amazon Kinesis Amazon EMR Amazon Redshift Amazon S3 Data scientist Amazon EMR for data scientists using Spot instances BI
  • 17. Amazon Beanstalk - App to push data into Amazon Kinesis
  • 18. • Amazon software license linking – Add ASL dependency to SBT/MAVEN project (artifactId = spark-streaming- kinesis-asl_2.10) • Shards - Include head-room to catching up with data in stream • Tracking Amazon Kinesis application state (DynamoDB) • Kinesis-Application:DynamoDB-table (1:1) • Created automatically • Make sure application name doesn’t conflict with existing DynamoDB tables. • Adjust DynamoDB provision throughput if necessary (default 10 reads per sec & 10 writes per second) Amazon Kinesis Applications – Tips
  • 19. Spark on Amazon EMR - Tips • Amazon EMR applications after version 3.8.0 (no need to run bootstrap actions) • Use Spot instances for time & cost saving especially when using Spark • Run in Yarn cluster mode (--master yarn-cluster) for production jobs – Spark driver runs in application master (high availability) • Data serialization – use Kryo if possible to boost performance (spark.serializer=org.apache.spark.serializer.KryoSerializer)
  • 20. The Life of a Click at Hearst • Hearst’s journey with their big data analytics platform on AWS • Demo • Clickstream analysis patterns • Lessons learned
  • 21.
  • 22. Have you heard of Hearst?
  • 23. BUSINESS MEDIA operates more than 20 business-to-businesses with significant holdings in the automotive, electronic, medical and finance industries MAGAZINES publishes 20 U.S. titles and close to 300 international editions BROADCASTING comprises 31 television and two radio stations NEWSPAPERS owns 15 daily and 34 weekly newspapers Hearst includes over 200 businesses in over 100 countries around the world
  • 24. Data Services at Hearst – Our Mission • Ensure that Hearst leverages its combined data assets • Unify Hearst’s data streams • Development of Big Data Analytics Platform using AWS services • Promote enterprise-wide product development • Example: product initiative led by all of Hearst’s editors – Buzzing@Hearst
  • 25. 1
  • 26. Business Value of Buzzing • Instant feedback on articles from our audiences • Incremental re-syndication of popular articles across properties (e.g. trending newspaper articles can be adopted by magazines) • Inform the editors to write articles that are more relevant to our audiences and what channels are our audiences leveraging to read our articles • Ultimately, drive incremental value • 25% more page views, 15% more visitors which lead to incremental revenue
  • 27. • Throughput goal: transport data from all 250+ Hearst properties worldwide • Latency goal: click-to-tool in under 5 minutes • Agile: easily add new data fields into clickstream • Unique metrics requirements defined by Data Science team (e.g., standard deviations, regressions, etc.) • Data reporting windows ranging from 1 hour to 1 week • Front-end developed “from scratch” so data exposed through API must support development team’s unique requirements Most importantly, operation of existing sites cannot be affected! Engineering Requirements of Buzzing…
  • 28. What we had to work with… a ”static” clickstream collection process on many Hearst sites Users to Hearst Properties Clickstream corporate data center Netezza Data Warehouse Once per day …now how do we get there? Used for ad hoc SQL-based reporting and analytics ~30 GB per day containing basic web log data (e.g., referrer, url, user agent, cookie, etc.)
  • 29. …and we own Hearst’s tag management system Users to Hearst Properties Clickstream This not only gave us access to the clickstream but also the JavaScript code that lives on our websites JavaScript on web pages
  • 30. Phase 1 – Ingest Clickstream Data Using AWS Amazon Kinesis Node.JS App- Proxy Kinesis S3 App – KCL Libraries Users to Hearst Properties Clickstream “Raw JSON” Raw data Use tag manager to easily deploy JavaScript to all sites Kinesis Client Libraries and Kinesis Connectors persist data to Amazon S3 for durability ElasticBeanstalk with Node.JS exposes an HTTP endpoint which asynchronously takes the data and feeds to Amazon Kinesis Implement JavaScript on sites that call an exposed endpoint and pass in query parameters
  • 31. Node.JS – Push clickstream to Amazon Kinesis function pushToKinesis(data) { var params = { Data: data, /* required */ PartitionKey: guid(), StreamName: streamName /* required */ }; kinesis.putRecord(params, function(err, data) { if (err) { console.log(err, err.stack); // an error occurred } }); } app.get('/hearstkin.gif', function(req, res){ async.series([function(callback){ var queryData = url.parse(req.url, true).query; queryData.proxyts = new Date().getTime().toString(); pushToKinesis(JSON.stringify(queryData)); callback(null); }]); res.writeHead(200,{'Content-Type': 'text/plain', 'Access- Control-Allow-Origin': '*'}); res.end(imageGIF, 'binary'); }); http.createServer(app).listen(app.get('port'), function(){ console.log('Express server listening on port ' + app.get('port')); }); Asynchronous calls – ensures no user experience interruption Server timestamp – to create a unified timestamp. Amazon Kinesis now offers this out-of-the box! JSON format – this helps us downstream Kinesis Partition Key – guid() is a good partition key to ensure even distribution across the shards
  • 32. Ingest Monitoring - AWS Amazon Kinesis Monitoring AWS Elastic Beanstalk Monitoring Auto Scaling triggered by network in > 20MB. Then scale up to 40 instances.
  • 33. Phase 1- Summary • Use JSON formatting for payloads so more fields can be easily added without impacting downstream processing • HTTP call requires minimal code introduced to the actual site implementations • Flexible to meet rollout and growing demand • Elastic Beanstalk can be scaled • Amazon Kinesis stream can be re-sharded • Amazon S3 provides high durability storage for raw data • Once a reliable, scalable onboarding platform is in place, we can now focus on ETL!
  • 34. Phase 2a- Data Processing First Version (EMR) ETL on Amazon EMR “Raw JSON” Raw Data Clean Aggregate Data • Amazon EMR was chosen initially for processing due to ease of Amazon EMR creation … and Pig because we knew how to code in PigLatin • 50+ UDFs were written using Python…also because we knew Python
  • 35. Unfortunately, Pig was not performing well – 15 min latency Processing Clickstream Data with Pig set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.GzipCodec; REGISTER '/home/hadoop/PROD/parser.py' USING jython as pyudf; REGISTER '/home/hadoop/PROD/refclean.py' USING jython AS refs; AA0 = load 's3://BUCKET_NAME/rawjson/datadata.tsv.gz' using TextLoader as (line:chararray); A0 = FILTER AA0 BY ( pyudf.get_obj(line,'url') MATCHES '.*(ad+|gd+).*'; A1 = FOREACH A0 GENERATE ( pyudf.urlclean(pyudf.get_obj(line,'url')) as url:chararray, pyudf.get_obj(line,'hash') as hash:chararray, pyudf.get_obj(line,'icxid') as icxid:chararray, pyudf.pubclean(pyudf.get_obj(line,'icctm_ht_dtpub')) as pubdt:chararray, pyudf.get_obj(line,'icctm_ht_cnocl') as cnocl:chararray, pyudf.get_obj(line,'icctm_ht_athr') as author:chararray, pyudf.get_obj(line,'icctm_ht_attl') as title:chararray, pyudf.get_obj(line,'icctm_ht_aid') as cms_id:chararray, pyudf.num1(pyudf.get_obj(line,'mxdpth')) as mxdpth:double, pyudf.num2(pyudf.get_obj(line,'load')) as topsecs:double, refs.classy(pyudf.get_obj(line,'url'),1) as bu:chararray, pyudf.get_obj(line,'ip_address') as ip:chararray, pyudf.get_obj(line,'img') as img:chararray ; Gzip your output Regex in Pig! Python imports limited to what is allowed by Jython
  • 36. Phase 2b- Data Processing (SparkStreaming) Clean Aggregate Data Node.JS App- ProxyUsers to Hearst Properties Clickstream • Welcome Apache Spark– one framework for batch and realtime • Benefits – using same code for batch and real time ETL • Use Spot instances – cost savings • Drawbacks – Scala! Amazon Kinesis ETL on EMR
  • 37. Using SQL with Scala SparkSQL Since we knew SQL, we wrote Scala with embedded SQL Query endpointUrl = kinesis.us-west-2.amazonaws.com streamName= hearststream outputLoc.json.streaming = s3://hearstkinesisdata/processedsparkjson window.length = 300 sliding.interval = 300 outputLimit = 5000000 query1Table=hearst1 query1= SELECT simplestartq(proxyts, 5) as startq, urlclean(url) as url, hash, icxid, pubclean(icctm_ht_dtpub) as pubdt, classy(url,1) as bu, ip_address as ip, artcheck(classy(url,1),url) as artcheck, ref_type(ref,url) as ref_type, img, wc, contentSource FROM hearst1 val jsonRDD = sqlContext.jsonRDD(rdd1) jsonRDD.registerTempTable(query1Table.trim) val query1Result = sqlContext.sql(query1)//.limit(outputLimit.toInt) query1Result.registerTempTable(query2Table.trim) val query2Result = sqlContext.sql(query2) query2Result.registerTempTable(query3Table.trim) val query3Result = sqlContext.sql(query3).limit(outputLimit.toInt) val outPartitionFolder = UDFUtils.output60WithRolling(slidingInterval.toInt) query3Result.toJSON.saveAsTextFile("%s/%s".format(outputLocJSON, outPartitionFolder), classOf[org.apache.hadoop.io.compress.GzipCodec]) logger.info("New JSON file written to "+outputLoc+"/"+outPartitionFolder)
  • 38. Python UDF versus Scala Python def artcheck(bu,url): try: if url and bu: cleanurl = url[0:url.find("?")].strip('/') tailurl = url[findnth(url, '/', 3)+1:url.find("?")].strip('/') revurl=cleanurl[::-1] root=revurl[0:revurl.find('/')][::-1] if (bu=='HMI' or bu=='HMG') and re.compile('ad+|gd+').search(tailurl)!=None : return 'T' elif bu=='HTV' and root.isdigit()==True and re.compile('/search/').search(cleanurl)==None: return 'T' elif bu=='HNP' and re.compile('blog|fuelfix').search(url)!=None and re.compile(r'S*[0-9]{4,4}/[0-9]{2,2}/[0-9]{2,2}S*').search(tailurl)!=None : return 'T' elif bu=='HNP' and re.compile('businessinsider').search(url)!=None and re.compile(r'S*[0-9]{4,4}-[0-9]{2,2}').search(root)!=None : return 'T' elif bu=='HNP' and re.compile('blog|fuelfix|businessinsider').search(url)==None and re.compile('.php').search(url)!=None : return 'T' else : return 'F' else : return 'F' except: return 'F' def artcheck(bu:String, url: String )={ try{ val cleanurl = UDFUtils.utilurlclean(url.trim).stripSuffix("/") val pathClean = UDFUtils.pathURI(cleanurl) val lastContext = pathClean.split("/").last var resp = "F" if(("HMI"==bu||"HMG"==bu)&&Pattern.compile("/ad+|/gd+").matcher(pathClean).find()) resp="T" else if("HTV"==bu && StringUtils.isNumeric(lastContext) && !cleanurl.contains("/search/")) resp="T" else if("HNP"==bu && Pattern.compile("blog|fuelfix").matcher(url).find() && Pattern.compile("d{4}/d{2}/d{2}").matcher(pathClean).find()) resp="T" else if("HNP"==bu && Pattern.compile("businessinsider").matcher(url).find() && Pattern.compile("d{4}-d{2}").matcher(lastContext).find()) resp="T" else if("HNP"==bu && !Pattern.compile("blog|fuelfix|businessinsider").matcher(url).find()&& Pattern.compile(".php").matcher(url).find()) resp="T" resp} } catch{ case e:Exception => "F" } Scala Don’t be intimidated by Scala…if you know Python, the syntax can be similar re.compile('ad+|gd+'). Pattern.compile("ad+|gd+"). Try: Except: Try{} Catch{}
  • 39. Phase 3a- Data Science! Data Science on EC2 Clean Aggregate Data API-ready Data Amazon Kinesis ETL on EMR • We decided to perform our Data Science using SAS on Amazon EC2 initially because of the ability to perform both data manipulation and easily run complex data science techniques (e.g., regressions) • Great for exploration and initial development • Performing data science using this method took 3-5 minutes to complete
  • 40. SAS Code Example data _null_; call system("aws s3 cp s3://BUCKET_NAME/file.gz /home/ec2-user/LOGFILES/file.gz"); run; FILENAME IN pipe "gzip -dc /home/ec2-user/LOGFILES/file.gz" lrecl=32767; data temp1; FORMAT startq DATETIME19.; infile IN delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=1; input startq :YMDDTTM. url :$1000. pageviews :best32. visits :best32. author :$100. cms_id :$100. img :$1000. title :$1000.; run; Use pipe to read in S3 data and keep it compressed proc sql; CREATE TABLE metrics AS SELECT url FORMAT=$1000., SUM(pageviews) as pageviews, SUM(visits) as visits, SUM(fvisits) as fvisits, SUM(evisits) as evisits, MIN(ttct) as rec, COUNT(distinct startq) as frq, AVG(visits) as avg_visits_pp, SUM(visits1) as visits_soc, SUM(visits2) as visits_dir, SUM(visits3) as visits_int, SUM(visits4) as visits_sea, SUM(visits5) as visits_web, SUM(visits6) as visits_nws, SUM(visits7) as visits_pd, SUM(visits8) as visits_soc_fb, SUM(visits9) as visits_soc_tw, SUM(visits10) as visits_soc_pi, SUM(visits11) as visits_soc_re, SUM(visits12) as visits_soc_yt, SUM(visits13) as visits_soc_su, SUM(visits14) as visits_soc_gp, SUM(visits15) as visits_soc_li, SUM(visits16) as visits_soc_tb, SUM(visits17) as visits_soc_ot, CASE WHEN (SUM(v1) - SUM(v3) ) > 20 THEN ( SUM(v1) - SUM(v3) ) / 2 ELSE 0 END as trending FROM temp1 GROUP BY 1; Use PROC SQL when possible for easier translation to Amazon Redshift for production later on.
  • 41. Phase 3b- Split Data Science into Development and Production Amazon Kinesis Clean Aggregate Data API-ready Data Data Science “Production” Amazon Redshift ETL on EMR • Once Data Science models were established, we split the modeling and production • Production was moved to Amazon Redshift which provided much faster ability to read Amazon S3 data and process the data • Data Science processing time went down to 100 seconds! Use S3 to store data science models and apply them using Amazon Redshift Data Science “Development” on EC2 Statistical Models run once per day Models Agg Data
  • 42. select clean_url as url, trim(substring(max(proxyts||domain) from 20 for 1000)) as domain, trim(substring(max(proxyts||clean_cnocl) from 20 for 1000)) as cnocl, trim(substring(max(proxyts||img) from 20 for 1000)) as img, trim(substring(max(proxyts||title) from 20 for 1000)) as title, trim(substring(max(proxyts||section) from 20 for 1000)) as section, approximate count(distinct ic_fpc) as visits, count(1) as hits from kinesis_hits where bu='HMG' and (article_id is not null or author is not null or title is not null) group by 1; Amazon Redshift Code Example Cool trick to find the most recent value of a character field in one pass through the data
  • 43. Phase 4a- Elasticsearch Integration Amazon EMR PUSH Buzzing API S3 Storage Data Science Amazon Redshift ETL on EMR Since we had the Amazon EMR cluster running already, we used a handy Pig jar that made it easy to push data to Elasticsearch. S3 Storage Models Agg Data API Ready Data
  • 44. REGISTER /home/hadoop/pig/lib/piggybank.jar; REGISTER /home/hadoop/PROD/elasticsearch-hadoop-2.0.2.jar; DEFINE EsStorageDEV org.elasticsearch.hadoop.pig.EsStorage ('es.nodes = es-dev.hearst.io', 'es.port = 9200', 'es.http.timeout = 5m', 'es.index.auto.create = true'); SECTIONS = load 's3://hearstkinesisdata/ss.tsv' USING PigStorage('t') as (sectionid:chararray,cnt:long,visits:long,sectionname:chararray); STORE SECTIONS INTO 'content-sections-sync/content-sections-sync' USING EsStoragePROD; Pig Code – Push to ES Example Use handy Pig jar to push data to Elasticsearch The “Amazon EMR overhead” required to read small files added 2 min to latency
  • 45. Phase 4b- Elasticsearch Integration Sped Up Buzzing API S3 Storage API Ready Data Data Science Amazon Redshift ETL on EMR Since the Amazon Redshift code was run in a Python wrapper, solution was to push data directly into Elasticsearch Models Agg Data
  • 46. # Converting file into bulk-insert compatible format $bin/convert_json.php big.json create rowbyrow.json # Get mapping file ${aws} s3 cp S3://hearst/es_mapping es_mapping # Creating new ES index $(curl -XPUT http://es.hearst.io/content-web180-sync --data-binary es_mapping -s) # Performing bulk API call $(curl -XPOST http://es.hearst.io/content-web180-sync/_bulk --data-binary rowbyrow.json -s) "http://es.hearst.io/content-web180-sync" Script to Push to Elasticsearch Directly Converting one big input JSON file to a row-by-row JSON is a key step for making the data batch compatible Use a mapping file to manage the formatting in your index… very important for dates and numeric values that look like strings
  • 47. Final Data Pipeline Buzzing API API Ready Data Amazon Kinesis S3 Storage Node.JS App- Proxy Users to Hearst Properties Clickstream Data Science Application Amazon Redshift ETL on EMR 100 seconds 1G/day 30 seconds 5GB/day 5 seconds 1G/day Milliseconds 100GB/day LATENCY THROUGHPUT Models Agg Data
  • 48. Data Science Amazon Redshift ETL A more “visual” representation of our pipeline! Clickstream dataAmazon Kinesis Results API
  • 49. Version Transport S T O R A G E ETL S T O R A G E Analysis S T O R A G E Exposure Latency V1 Amazon Kinesis S3 EMR-Pig S3 EC2-SAS S3 EMR to ElasticSearch 1 hour Today Amazon Kinesis Spark-Scala S3 Amazon Redshift ElasticSearch <5 min Tomorrow Amazon Kinesis PySpark + SparkR ElasticSearch <2 min Lessons learned “No Duh’s” Removing “stoppage” points, speed up processing, and combine processes improve latency.
  • 50. Data Science Tool Box Buzzing API API Ready Data Amazon Kinesis S3 Storage Node.JS App- Proxy Users to Hearst Properties Clickstream Data Science Application Amazon Redshift ETL on EMR Models Agg Data • IPython Notebook • On Spark and Amazon Redshift • Code sharing (and insights) • User-friendly development environment for data scientists • Auto-convert .pynb  .py Data Science Toolbox Data Models Amazon Redshift
  • 51. Data Science at Hearst – Notebook
  • 52. Next Steps • Amazon EMR 4.1.0 with Spark 1.5 released and we can do more with pyspark, look at Apache Zeppelin on Amazon EMR • Amazon Kinesis just release a new feature to retain data up to 7 days - We could do more ETL “in the stream” • Amazon Kinesis Firehose and Lambda – Zero touch (no Amazon EC2 maintenance) • More complex data science that requires… • Amazon Redshift UDFs • Python shell that calls Amazon Redshift but also allows for complex statistical methods (e.g., using R or machine learning)
  • 53. Conclusion • Clickstreams are the new “data currency” of business • AWS provides great technology to process data • High speed • Lower costs – Using Spot… • Very agile • Do more with less: this can all be done with a team of 2 FTEs! • 1 developer (well versed in AWS) + 1 data scientist
  • 54. Ingest Store Process Analyze Click Insight Time Call To Action Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon RDS (Aurora) AWS Lambda KCL Apps Amazon EMR Amazon Redshift
  • 55. Use Amazon Kinesis, EMR and Amazon Redshift for Clickstream Open source connectors: • http://docs.aws.amazon.com/kinesis/latest/dev/developing- consumers-with-kcl.html AWS Big Data blog - http://blogs.aws.amazon.com/bigdata/ AWS re:Invent Big Data booth AWS Big Data Marketplace and Partner ecosystem Hearst Booth – Hall C1156: Learn more about the interesting things we are doing with data! Call To Action