SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Divolte Collector 
Because life’s too short for log file parsing 
GoDataDriven 
PROUDLY PART OF THE XEBIA GROUP 
@asnare / @fzk 
signal@godatadriven.com 
Andrew Snare / Friso van Vollenhoven
99% of all data in Hadoop 
156.68.7.63 - - [28/Jul/1995:11:53:28 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669 
137.244.160.140 - - [28/Jul/1995:11:53:29 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0 
163.205.160.5 - - [28/Jul/1995:11:53:31 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 4324 
163.205.160.5 - - [28/Jul/1995:11:53:40 -0400] "GET /shuttle/countdown/count70.gif HTTP/1.0" 200 46573 
140.229.50.189 - - [28/Jul/1995:11:53:54 -0400] "GET /shuttle/missions/sts-67/images/images.html HTTP/1.0" 163.206.89.4 - - [28/Jul/1995:11:54:02 -0400] "GET /shuttle/technology/sts-newsref/sts-mps.html HTTP/1.0" 200 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 
163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891 
131.110.53.48 - - [28/Jul/1995:11:54:07 -0400] "GET /shuttle/technology/sts-newsref/stsref-toc.html HTTP/1.0" 163.205.160.5 - - [28/Jul/1995:11:54:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 
130.160.196.81 - - [28/Jul/1995:11:54:15 -0400] "GET /shuttle/resources/orbiters/challenger.html HTTP/1.0" 131.110.53.48 - - [28/Jul/1995:11:54:16 -0400] "GET /images/shuttle-patch-small.gif HTTP/1.0" 200 4179 
137.244.160.140 - - [28/Jul/1995:11:54:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 131.110.53.48 - - [28/Jul/1995:11:54:18 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 
131.110.53.48 - - [28/Jul/1995:11:54:19 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713 
130.160.196.81 - - [28/Jul/1995:11:54:19 -0400] "GET /shuttle/resources/orbiters/challenger-logo.gif HTTP/1.0" 163.205.160.5 - - [28/Jul/1995:11:54:25 -0400] "GET /shuttle/missions/sts-70/images/images.html HTTP/1.0" 200 130.181.4.158 - - [28/Jul/1995:11:54:26 -0400] "GET /history/rocket-history.txt HTTP/1.0" 200 26990 
137.244.160.140 - - [28/Jul/1995:11:54:30 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 304 0 
137.244.160.140 - - [28/Jul/1995:11:54:31 -0400] "GET /images/launch-logo.gif HTTP/1.0" 304 0 
137.244.160.140 - - [28/Jul/1995:11:54:38 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 304 168.178.17.149 - - [28/Jul/1995:11:54:48 -0400] "GET /shuttle/missions/sts-65/mission-sts-65.html HTTP/1.0" 140.229.50.189 - - [28/Jul/1995:11:54:53 -0400] "GET /shuttle/missions/sts-67/images/KSC-95EC-0390.jpg HTTP/131.110.53.48 - - [28/Jul/1995:11:54:58 -0400] "GET /shuttle/missions/missions.html HTTP/1.0" 200 8677 
131.110.53.48 - - [28/Jul/1995:11:55:02 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853 
131.110.53.48 - - [28/Jul/1995:11:55:05 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 
128.159.111.141 - - [28/Jul/1995:11:55:09 -0400] "GET /procurement/procurement.html HTTP/1.0" 200 3499 
128.159.111.141 - - [28/Jul/1995:11:55:10 -0400] "GET /images/op-logo-small.gif HTTP/1.0" 200 14915 
128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 
128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 
192.213.154.220 - - [28/Jul/1995:11:55:15 -0400] "GET /shuttle/countdown/tour.html HTTP/1.0" 200 4347
Typical web optimization architecture 
USER 
HTTP request: 
/org/apache/hadoop/io/IOUtils.html 
log transport 
service 
log event: 
2012-07-01T06:00:02.500Z /org/apache/hadoop/io/IOUtils.html 
transport logs to 
compute cluster 
(e.g. recommendations) streaming log 
off line analytics / 
model training 
serve model result 
batch update 
model state 
processing 
streaming update 
model state
Parse HTTP server logs 
access.log
How did it get there? 
Option 1: parse HTTP server logs 
• Ship log files on a schedule 
• Parse using MapReduce jobs 
• Batch analytics jobs feed online systems
HTTP server log parsing 
• Inherently batch oriented 
• Schema-less (URL format is the schema) 
• Initial job to parse logs into structured format 
• Usually multiple versions of parsers required 
• Requires sessionizing 
• Logs usually have more than you ask for (bots, 
image requests, spiders, health check, etc.)
Stream HTTP server logs 
access.log 
Message Queue or Event Transport 
(Kafka, Flume, etc.) 
EVENTS 
tail -F 
EVENTS 
OTHER 
CONSUMERS
How did it get there? 
Option 2: stream HTTP server logs 
• tail -F logfiles 
• Use a queue for transport (e.g. Flume or Kafka) 
• Parse logs on the fly 
• Or write semi-schema’d logs, like JSON 
• Parse again for batch work load
Stream HTTP server logs 
• Allows for near real-time event handling when 
consuming from queues 
• Sessionizing? Duplicates? Bots? 
• Still requires parser logic 
• No schema
Tagging 
tracking traffic 
(asynchronous) 
index. 
html 
script. 
js 
tracking server 
access.log 
web server 
Message Queue or Event Transport 
(Kafka, Flume, etc.) 
EVENTS 
OTHER 
CONSUMERS 
web page traffic 
structured events 
structured events
How did it get there? 
Option 3: tagging 
• Instrument pages with special ‘tag’, i.e. special 
JavaScript or image just for logging the request 
• Create special endpoint that handles the tag 
request in a structured way 
• Tag endpoint handles logging the events
Tagging 
• Not a new idea (Google Analytics, Omniture, 
etc.) 
• Less garbage traffic, because a browser is 
required to evaluate the tag 
• Event logging is asynchronous 
• Easier to do inflight processing (apply a schema, 
add enrichments, etc.) 
• Allows for custom events (other than page view)
Also… 
• Manage session through cookies on the client 
side 
• Incoming data is already sessionized 
• Extract additional information from clients 
• Screen resolution 
• Viewport size 
• Timezone
Looks familiar? 
<script> 
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ 
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), 
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) 
})(window,document,'script','//www.google-analytics.com/analytics.js','ga'); 
! 
ga('create', 'UA-40578233-2', 'godatadriven.com'); 
ga('send', 'pageview'); 
! 
</script>
Divolte Collector 
Tag based click stream 
data collection for 
Hadoop and Kafka.
Divolte Collector 
tracking traffic 
(asynchronous) 
index. 
html 
script. 
js 
tracking server 
access.log 
web server 
Message Queue or Event Transport 
(Kafka, Flume, etc.) 
EVENTS 
OTHER 
CONSUMERS 
web page traffic 
structured events 
structured events
The TAG 
<script src="//tr.example.com/divolte.js" 
defer 
async> 
</script>
Schema! 
{ 
"namespace": "com.example.record", 
"type": "record", 
"name": "ClickEventRecord", 
"fields": [ 
{ "name": "productNumber", "type": ["null", "string"], "default": null }, 
{ "name": "shop", "type": ["null", "string"], "default": null }, 
{ "name": "category", "type": ["null", "string"], "default": null }, 
{ "name": "advisor", "type": ["null", "string"], "default": null }, 
{ "name": "searchPhrase", "type": ["null", "string"], "default": null }, 
{ "name": "basketProductNumber", "type": ["null", "string"], "default": null }, 
{ "name": "basketSizeCode", "type": ["null", "string"], "default": null }, 
{ "name": "basketProductCount", "type": ["null", "string"], "default": null } 
] 
}
Mapping 
// Page type detector: 
// http://.../basket 
basket = "^https?://[^/]+/basket(?:[?#].*)?$" 
! 
// Page type detector: 
// http://.../search?q=fiets 
search = "^https?://[^/]+/search?.*$" 
! 
// Page type detector: 
// http://.../checkout 
checkout = "^https?://[^/]+/checkout(?:[?#].*)?$" 
! 
// Page type detector: 
// http://.../thankyou 
payment_ok = "^https://[^/]+/thankyou(?:[?#].*)?$"
Mapping 
pageType { 
type = regex_name 
regexes = [ 
home, category, shop, basket, search, customercare 
] 
field = location 
} 
productNumber { 
type = regex_group 
regex = pdp 
field = location 
group = product 
} 
viewportPixelWidth = viewportPixelWidth 
viewportPixelHeight = viewportPixelHeight 
screenPixelWidth = screenPixelWidth 
screenPixelHeight = screenPixelHeight
Configure 
divolte { 
server { 
host = 0.0.0.0 
use_x_forwarded_for = true 
landing_page = false 
} 
! 
tracking { 
cookie_domain = .example.com 
include "click-schema-mapping.conf" 
schema_file = /etc/divolte/ClickEventRecord.avsc 
} 
! 
…
Configure 
kafka_flusher { 
enabled = true 
producer = { 
metadata.broker.list = [ 
"broker1:9092", 
"broker2:9092", 
"broker3:9092" 
] 
} 
} 
! 
…
Configure 
hdfs_flusher { 
hdfs { 
replication = 3 
} 
! 
simple_rolling_file_strategy { 
roll_every = 60 minutes 
sync_file_after_records = 1000 
sync_file_after_duration = 10 seconds 
! 
working_dir = /divolte/inflight 
publish_dir = /divolte/published 
} 
} 
}
Run 
./bin/divolte-collector
Demo: Javadoc analytics! 
javadoc -d outputdir  
-bottom '<script src="//localhost:8290/divolte.js" 
defer async></script>'  
-subpackages .
Kafka event consumer
private static class JavadocEventHandler implements EventHandler<JavadocEventRecord> { 
private static final String TCP_SERVER_HOST = "127.0.0.1"; 
private static final int TCP_SERVER_PORT = 1234; 
! 
private Socket socket = null; 
private OutputStream stream; 
! 
@Override 
public void setup() throws Exception { 
socket = new Socket(TCP_SERVER_HOST, TCP_SERVER_PORT); 
stream = socket.getOutputStream(); 
} 
! 
@Override 
public void handle(JavadocEventRecord event) throws Exception { 
if (!event.getDetectedDuplicate()) { 
// Avro's toString already produces JSON. 
stream.write(event.toString().getBytes(StandardCharsets.UTF_8)); 
stream.write("n".getBytes(StandardCharsets.UTF_8)); 
} 
} 
! 
@Override 
public void shutdown() throws Exception { 
if (null != stream) stream.close(); 
if (null != socket) socket.close(); 
} 
}
public static void main(String[] args) { 
final DivolteKafkaConsumer<JavadocEventRecord> consumer = 
DivolteKafkaConsumer.createConsumer( 
KAFKA_TOPIC, 
ZOOKEEPER_QUORUM, 
KAFKA_CONSUMER_GROUP_ID, 
NUM_CONSUMER_THREADS, 
() -> new JavadocEventHandler(), 
JavadocEventRecord.getClassSchema()); 
! 
Runtime.getRuntime().addShutdownHook(new Thread(() -> { 
System.out.println("Shutting down consumer."); 
consumer.shutdownConsumer(); 
})); 
! 
System.out.println("Starting consumer."); 
consumer.startConsumer(); 
}
SQL FTW!
CREATE EXTERNAL TABLE javadoc_analytics ( 
firstInSession boolean 
-- other fields are created automatically from schema 
) 
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
LOCATION 
'/divolte/published' 
TBLPROPERTIES ( 
'avro.schema.url'='hdfs:///JavadocEventRecord.avsc' 
);
Python & Spark
export IPYTHON=1 
export IPYTHON_OPTS="notebook --ip=0.0.0.0" 
pyspark  
--jars divolte-spark-assembly-0.1.jar  
--driver-class-path divolte-spark-assembly-0.1.jar  
--num-executors 40
Spark 
& 
Spark Streaming
import io.divolte.spark.avro._ 
import org.apache.avro.generic.IndexedRecord 
import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
! 
val sc = new SparkContext() 
val events = sc.newAvroFile[IndexedRecord](path) 
! 
// And then… 
val records = events.toRecords 
// or 
val eventFields = events.fields("sessionId", "location", "timestamp")
// Kafka configuration. 
val consumerConfig = Map( 
"group.id" -> "some-id-for-the-consumer-group", 
"zookeeper.connect" -> "zookeeper-connect-string", 
"auto.commit.interval.ms" -> "5000", 
"auto.offset.reset" -> "largest" 
) 
val topicSettings = Map("divolte" -> Runtime.getRuntime.availableProcessors()) 
! 
val sc = new SparkContext() 
val ssc = new StreamingContext(sc, Seconds(15)) 
! 
// Establish the source event stream. 
val stream = ssc.divolteStream[GenericRecord](consumerConfig, topicSettings, StorageLevel.MEMORY_ONLY) 
! 
// And then… 
val eventStream = stream.toRecords 
// or 
val locationStream = stream.fields("location")
Also in the box
Zero config deploy 
• Easy to use for local development 
•Works out of the box with zero custom config 
• Comes with a built in schema and mapping 
•Works on local machine without Hadoop 
• Flushes to /tmp on local file system
Collector has no global state 
• Load balancer friendly 
• Horizontally scalable 
• Shared nothing 
• (other than HDFS and Kafka)
In stream de-duplication 
• The internet is a mean place; data will have noise 
• In stream hash based de-duplication 
• Low false negative rate 
• Virtually zero false positive rate 
• Requires URI based routing from load balancer 
• Easy to setup on nginx 
• Supported on many hardware load balancers
Corrupt request detection 
• The internet is still a mean place… Some URLS 
are truncated 
• Incomplete events detected and discarded
Defeat Chrome’s pre-rendering 
• Chrome sometimes speculatively pre-renders 
pages in the background 
• This triggers JS even if the page is not shown 
• Unless you use the Page Visibility API to detect 
this 
•Which we do 
•We take care of many other JS caveats as well
Custom events 
• Divolte presents itself as a JS library 
• Map custom event parameters directly onto Avro 
fields 
<!-- client side --> 
<script> 
divolte.signal("addToBasket", { 
count: 2, 
productId: "a3bc38de" 
}) 
</script> 
// server side mapping 
eventType = eventType 
! 
basketProductId { 
type = event_parameter 
name = productId 
}
Bring your own IDs 
• Generate page view ID on server side 
• Possible to relate server side logging to page 
views and other client side events 
<script 
src="//…/divolte.js#a28de3bf42a5dc98c03" 
defer 
async> 
</script>
User agent parsing 
• On the fly parsing of user agent string 
• Uses: http://uadetector.sourceforge.net/ 
• Updates user agent database at runtime without 
restart
IP to geo coordinates 
• On the fly enrichment with geo coordinates 
based on IP address 
• MaxMind geoIP database 
• https://www.maxmind.com/en/geoip2-databases 
• Updates database at runtime without restart 
• Sets: 
• Latitude & longitude 
• Country, City, Subdivision
https://github.com/divolte/divolte-collector 
https://github.com/divolte/divolte-examples 
https://github.com/divolte/divolte-kafka-consumer 
https://github.com/divolte/divolte-spark
GoDataDriven 
We’re hiring / Questions? / Thank you! 
@asnare / @fzk 
signal@godatadriven.com 
Andrew Snare / Friso van Vollenhoven

Contenu connexe

Tendances

Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
Deepak Singh
 

Tendances (18)

Eddystone Beacons - Physical Web - Giving a URL to All Objects
Eddystone Beacons - Physical Web - Giving a URL to All ObjectsEddystone Beacons - Physical Web - Giving a URL to All Objects
Eddystone Beacons - Physical Web - Giving a URL to All Objects
 
Beacons, Raspberry Pi & Node.js
Beacons, Raspberry Pi & Node.jsBeacons, Raspberry Pi & Node.js
Beacons, Raspberry Pi & Node.js
 
Best Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance Issues
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyond
 
Sling tracer and Chrome Plugin to the Rescue
Sling tracer and Chrome Plugin to the RescueSling tracer and Chrome Plugin to the Rescue
Sling tracer and Chrome Plugin to the Rescue
 
Velocity 2014 nyc WebPagetest private instances
Velocity 2014 nyc   WebPagetest private instancesVelocity 2014 nyc   WebPagetest private instances
Velocity 2014 nyc WebPagetest private instances
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
TDD of HTTP Clients With WebMock
TDD of HTTP Clients With WebMockTDD of HTTP Clients With WebMock
TDD of HTTP Clients With WebMock
 
Progressive What Apps?
Progressive What Apps?Progressive What Apps?
Progressive What Apps?
 
Future Decoded - Node.js per sviluppatori .NET
Future Decoded - Node.js per sviluppatori .NETFuture Decoded - Node.js per sviluppatori .NET
Future Decoded - Node.js per sviluppatori .NET
 
Testing http calls with Webmock and VCR
Testing http calls with Webmock and VCRTesting http calls with Webmock and VCR
Testing http calls with Webmock and VCR
 
Web Audio API + AngularJS
Web Audio API + AngularJSWeb Audio API + AngularJS
Web Audio API + AngularJS
 
Bluetooth Beacon Tracking on a Budget
Bluetooth Beacon Tracking on a BudgetBluetooth Beacon Tracking on a Budget
Bluetooth Beacon Tracking on a Budget
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera Software
 
An introduction to cgroups and cgroupspy
An introduction to cgroups and cgroupspyAn introduction to cgroups and cgroupspy
An introduction to cgroups and cgroupspy
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
 
How we use and deploy Varnish at Opera
How we use and deploy Varnish at OperaHow we use and deploy Varnish at Opera
How we use and deploy Varnish at Opera
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-python
 

En vedette

En vedette (9)

Divolte collector overview
Divolte collector overviewDivolte collector overview
Divolte collector overview
 
Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19
 
Apache Spark Talk for Applied machine learning
Apache Spark Talk for Applied machine learningApache Spark Talk for Applied machine learning
Apache Spark Talk for Applied machine learning
 
Real time data driven applications (and SQL vs NoSQL databases)
Real time data driven applications (and SQL vs NoSQL databases)Real time data driven applications (and SQL vs NoSQL databases)
Real time data driven applications (and SQL vs NoSQL databases)
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4j
 
Snowplow at Sigfig
Snowplow at SigfigSnowplow at Sigfig
Snowplow at Sigfig
 
Event Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and Samza
 
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
Kafka and Stream Processing, Taking Analytics Real-time, Mike SpicerKafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
 
Google Analytics vs. Omniture Comparative Guide
Google Analytics vs. Omniture Comparative GuideGoogle Analytics vs. Omniture Comparative Guide
Google Analytics vs. Omniture Comparative Guide
 

Similaire à Divolte Collector - meetup presentation

#NewMeetup Performance
#NewMeetup Performance#NewMeetup Performance
#NewMeetup Performance
Justin Cataldo
 
webinale2011_Chris Mills_Brave new world of HTML5Html5
webinale2011_Chris Mills_Brave new world of HTML5Html5webinale2011_Chris Mills_Brave new world of HTML5Html5
webinale2011_Chris Mills_Brave new world of HTML5Html5
smueller_sandsmedia
 
A Presentation about Puppet that I've made at the OSSPAC conference
A Presentation about Puppet that I've made at the OSSPAC conferenceA Presentation about Puppet that I've made at the OSSPAC conference
A Presentation about Puppet that I've made at the OSSPAC conference
ohadlevy
 

Similaire à Divolte Collector - meetup presentation (20)

Online machine Learning with Divolte
Online machine Learning with DivolteOnline machine Learning with Divolte
Online machine Learning with Divolte
 
Stream processing in Mercari - Devsumi 2015 autumn LT
Stream processing in Mercari - Devsumi 2015 autumn LTStream processing in Mercari - Devsumi 2015 autumn LT
Stream processing in Mercari - Devsumi 2015 autumn LT
 
Go Web Development
Go Web DevelopmentGo Web Development
Go Web Development
 
Logstash for SEO: come monitorare i Log del Web Server in realtime
Logstash for SEO: come monitorare i Log del Web Server in realtimeLogstash for SEO: come monitorare i Log del Web Server in realtime
Logstash for SEO: come monitorare i Log del Web Server in realtime
 
#NewMeetup Performance
#NewMeetup Performance#NewMeetup Performance
#NewMeetup Performance
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
 
초기 스타트업의 AWS - 김지훈(투어라이브) :: AWS Community Day Online 2020
초기 스타트업의 AWS - 김지훈(투어라이브) :: AWS Community Day Online 2020초기 스타트업의 AWS - 김지훈(투어라이브) :: AWS Community Day Online 2020
초기 스타트업의 AWS - 김지훈(투어라이브) :: AWS Community Day Online 2020
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & ElasticsearchFrom Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
 
Cape Cod Web Technology Meetup - 2
Cape Cod Web Technology Meetup - 2Cape Cod Web Technology Meetup - 2
Cape Cod Web Technology Meetup - 2
 
Brave new world of HTML5
Brave new world of HTML5Brave new world of HTML5
Brave new world of HTML5
 
webinale2011_Chris Mills_Brave new world of HTML5Html5
webinale2011_Chris Mills_Brave new world of HTML5Html5webinale2011_Chris Mills_Brave new world of HTML5Html5
webinale2011_Chris Mills_Brave new world of HTML5Html5
 
How to go the extra mile on monitoring
How to go the extra mile on monitoringHow to go the extra mile on monitoring
How to go the extra mile on monitoring
 
A Presentation about Puppet that I've made at the OSSPAC conference
A Presentation about Puppet that I've made at the OSSPAC conferenceA Presentation about Puppet that I've made at the OSSPAC conference
A Presentation about Puppet that I've made at the OSSPAC conference
 
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logs
 
Time tested php with libtimemachine
Time tested php with libtimemachineTime tested php with libtimemachine
Time tested php with libtimemachine
 
Logstash
LogstashLogstash
Logstash
 
How to improve ELK log pipeline performance
How to improve ELK log pipeline performanceHow to improve ELK log pipeline performance
How to improve ELK log pipeline performance
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
 
Monitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachineMonitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachine
 

Plus de fvanvollenhoven (8)

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networks
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshop
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Berlin Buzzwords preso
Berlin Buzzwords presoBerlin Buzzwords preso
Berlin Buzzwords preso
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Divolte Collector - meetup presentation

  • 1. Divolte Collector Because life’s too short for log file parsing GoDataDriven PROUDLY PART OF THE XEBIA GROUP @asnare / @fzk signal@godatadriven.com Andrew Snare / Friso van Vollenhoven
  • 2. 99% of all data in Hadoop 156.68.7.63 - - [28/Jul/1995:11:53:28 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669 137.244.160.140 - - [28/Jul/1995:11:53:29 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0 163.205.160.5 - - [28/Jul/1995:11:53:31 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 4324 163.205.160.5 - - [28/Jul/1995:11:53:40 -0400] "GET /shuttle/countdown/count70.gif HTTP/1.0" 200 46573 140.229.50.189 - - [28/Jul/1995:11:53:54 -0400] "GET /shuttle/missions/sts-67/images/images.html HTTP/1.0" 163.206.89.4 - - [28/Jul/1995:11:54:02 -0400] "GET /shuttle/technology/sts-newsref/sts-mps.html HTTP/1.0" 200 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 163.206.89.4 - - [28/Jul/1995:11:54:05 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891 131.110.53.48 - - [28/Jul/1995:11:54:07 -0400] "GET /shuttle/technology/sts-newsref/stsref-toc.html HTTP/1.0" 163.205.160.5 - - [28/Jul/1995:11:54:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 130.160.196.81 - - [28/Jul/1995:11:54:15 -0400] "GET /shuttle/resources/orbiters/challenger.html HTTP/1.0" 131.110.53.48 - - [28/Jul/1995:11:54:16 -0400] "GET /images/shuttle-patch-small.gif HTTP/1.0" 200 4179 137.244.160.140 - - [28/Jul/1995:11:54:16 -0400] "GET /shuttle/missions/sts-69/mission-sts-69.html HTTP/1.0" 131.110.53.48 - - [28/Jul/1995:11:54:18 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 131.110.53.48 - - [28/Jul/1995:11:54:19 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713 130.160.196.81 - - [28/Jul/1995:11:54:19 -0400] "GET /shuttle/resources/orbiters/challenger-logo.gif HTTP/1.0" 163.205.160.5 - - [28/Jul/1995:11:54:25 -0400] "GET /shuttle/missions/sts-70/images/images.html HTTP/1.0" 200 130.181.4.158 - - [28/Jul/1995:11:54:26 -0400] "GET /history/rocket-history.txt HTTP/1.0" 200 26990 137.244.160.140 - - [28/Jul/1995:11:54:30 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 304 0 137.244.160.140 - - [28/Jul/1995:11:54:31 -0400] "GET /images/launch-logo.gif HTTP/1.0" 304 0 137.244.160.140 - - [28/Jul/1995:11:54:38 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 304 168.178.17.149 - - [28/Jul/1995:11:54:48 -0400] "GET /shuttle/missions/sts-65/mission-sts-65.html HTTP/1.0" 140.229.50.189 - - [28/Jul/1995:11:54:53 -0400] "GET /shuttle/missions/sts-67/images/KSC-95EC-0390.jpg HTTP/131.110.53.48 - - [28/Jul/1995:11:54:58 -0400] "GET /shuttle/missions/missions.html HTTP/1.0" 200 8677 131.110.53.48 - - [28/Jul/1995:11:55:02 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853 131.110.53.48 - - [28/Jul/1995:11:55:05 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 128.159.111.141 - - [28/Jul/1995:11:55:09 -0400] "GET /procurement/procurement.html HTTP/1.0" 200 3499 128.159.111.141 - - [28/Jul/1995:11:55:10 -0400] "GET /images/op-logo-small.gif HTTP/1.0" 200 14915 128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786 128.159.111.141 - - [28/Jul/1995:11:55:11 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204 192.213.154.220 - - [28/Jul/1995:11:55:15 -0400] "GET /shuttle/countdown/tour.html HTTP/1.0" 200 4347
  • 3.
  • 4. Typical web optimization architecture USER HTTP request: /org/apache/hadoop/io/IOUtils.html log transport service log event: 2012-07-01T06:00:02.500Z /org/apache/hadoop/io/IOUtils.html transport logs to compute cluster (e.g. recommendations) streaming log off line analytics / model training serve model result batch update model state processing streaming update model state
  • 5. Parse HTTP server logs access.log
  • 6. How did it get there? Option 1: parse HTTP server logs • Ship log files on a schedule • Parse using MapReduce jobs • Batch analytics jobs feed online systems
  • 7. HTTP server log parsing • Inherently batch oriented • Schema-less (URL format is the schema) • Initial job to parse logs into structured format • Usually multiple versions of parsers required • Requires sessionizing • Logs usually have more than you ask for (bots, image requests, spiders, health check, etc.)
  • 8. Stream HTTP server logs access.log Message Queue or Event Transport (Kafka, Flume, etc.) EVENTS tail -F EVENTS OTHER CONSUMERS
  • 9. How did it get there? Option 2: stream HTTP server logs • tail -F logfiles • Use a queue for transport (e.g. Flume or Kafka) • Parse logs on the fly • Or write semi-schema’d logs, like JSON • Parse again for batch work load
  • 10. Stream HTTP server logs • Allows for near real-time event handling when consuming from queues • Sessionizing? Duplicates? Bots? • Still requires parser logic • No schema
  • 11. Tagging tracking traffic (asynchronous) index. html script. js tracking server access.log web server Message Queue or Event Transport (Kafka, Flume, etc.) EVENTS OTHER CONSUMERS web page traffic structured events structured events
  • 12. How did it get there? Option 3: tagging • Instrument pages with special ‘tag’, i.e. special JavaScript or image just for logging the request • Create special endpoint that handles the tag request in a structured way • Tag endpoint handles logging the events
  • 13. Tagging • Not a new idea (Google Analytics, Omniture, etc.) • Less garbage traffic, because a browser is required to evaluate the tag • Event logging is asynchronous • Easier to do inflight processing (apply a schema, add enrichments, etc.) • Allows for custom events (other than page view)
  • 14. Also… • Manage session through cookies on the client side • Incoming data is already sessionized • Extract additional information from clients • Screen resolution • Viewport size • Timezone
  • 15. Looks familiar? <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); ! ga('create', 'UA-40578233-2', 'godatadriven.com'); ga('send', 'pageview'); ! </script>
  • 16. Divolte Collector Tag based click stream data collection for Hadoop and Kafka.
  • 17. Divolte Collector tracking traffic (asynchronous) index. html script. js tracking server access.log web server Message Queue or Event Transport (Kafka, Flume, etc.) EVENTS OTHER CONSUMERS web page traffic structured events structured events
  • 18. The TAG <script src="//tr.example.com/divolte.js" defer async> </script>
  • 19. Schema! { "namespace": "com.example.record", "type": "record", "name": "ClickEventRecord", "fields": [ { "name": "productNumber", "type": ["null", "string"], "default": null }, { "name": "shop", "type": ["null", "string"], "default": null }, { "name": "category", "type": ["null", "string"], "default": null }, { "name": "advisor", "type": ["null", "string"], "default": null }, { "name": "searchPhrase", "type": ["null", "string"], "default": null }, { "name": "basketProductNumber", "type": ["null", "string"], "default": null }, { "name": "basketSizeCode", "type": ["null", "string"], "default": null }, { "name": "basketProductCount", "type": ["null", "string"], "default": null } ] }
  • 20. Mapping // Page type detector: // http://.../basket basket = "^https?://[^/]+/basket(?:[?#].*)?$" ! // Page type detector: // http://.../search?q=fiets search = "^https?://[^/]+/search?.*$" ! // Page type detector: // http://.../checkout checkout = "^https?://[^/]+/checkout(?:[?#].*)?$" ! // Page type detector: // http://.../thankyou payment_ok = "^https://[^/]+/thankyou(?:[?#].*)?$"
  • 21. Mapping pageType { type = regex_name regexes = [ home, category, shop, basket, search, customercare ] field = location } productNumber { type = regex_group regex = pdp field = location group = product } viewportPixelWidth = viewportPixelWidth viewportPixelHeight = viewportPixelHeight screenPixelWidth = screenPixelWidth screenPixelHeight = screenPixelHeight
  • 22. Configure divolte { server { host = 0.0.0.0 use_x_forwarded_for = true landing_page = false } ! tracking { cookie_domain = .example.com include "click-schema-mapping.conf" schema_file = /etc/divolte/ClickEventRecord.avsc } ! …
  • 23. Configure kafka_flusher { enabled = true producer = { metadata.broker.list = [ "broker1:9092", "broker2:9092", "broker3:9092" ] } } ! …
  • 24. Configure hdfs_flusher { hdfs { replication = 3 } ! simple_rolling_file_strategy { roll_every = 60 minutes sync_file_after_records = 1000 sync_file_after_duration = 10 seconds ! working_dir = /divolte/inflight publish_dir = /divolte/published } } }
  • 26. Demo: Javadoc analytics! javadoc -d outputdir -bottom '<script src="//localhost:8290/divolte.js" defer async></script>' -subpackages .
  • 28. private static class JavadocEventHandler implements EventHandler<JavadocEventRecord> { private static final String TCP_SERVER_HOST = "127.0.0.1"; private static final int TCP_SERVER_PORT = 1234; ! private Socket socket = null; private OutputStream stream; ! @Override public void setup() throws Exception { socket = new Socket(TCP_SERVER_HOST, TCP_SERVER_PORT); stream = socket.getOutputStream(); } ! @Override public void handle(JavadocEventRecord event) throws Exception { if (!event.getDetectedDuplicate()) { // Avro's toString already produces JSON. stream.write(event.toString().getBytes(StandardCharsets.UTF_8)); stream.write("n".getBytes(StandardCharsets.UTF_8)); } } ! @Override public void shutdown() throws Exception { if (null != stream) stream.close(); if (null != socket) socket.close(); } }
  • 29. public static void main(String[] args) { final DivolteKafkaConsumer<JavadocEventRecord> consumer = DivolteKafkaConsumer.createConsumer( KAFKA_TOPIC, ZOOKEEPER_QUORUM, KAFKA_CONSUMER_GROUP_ID, NUM_CONSUMER_THREADS, () -> new JavadocEventHandler(), JavadocEventRecord.getClassSchema()); ! Runtime.getRuntime().addShutdownHook(new Thread(() -> { System.out.println("Shutting down consumer."); consumer.shutdownConsumer(); })); ! System.out.println("Starting consumer."); consumer.startConsumer(); }
  • 31. CREATE EXTERNAL TABLE javadoc_analytics ( firstInSession boolean -- other fields are created automatically from schema ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/divolte/published' TBLPROPERTIES ( 'avro.schema.url'='hdfs:///JavadocEventRecord.avsc' );
  • 33. export IPYTHON=1 export IPYTHON_OPTS="notebook --ip=0.0.0.0" pyspark --jars divolte-spark-assembly-0.1.jar --driver-class-path divolte-spark-assembly-0.1.jar --num-executors 40
  • 34.
  • 35. Spark & Spark Streaming
  • 36. import io.divolte.spark.avro._ import org.apache.avro.generic.IndexedRecord import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ ! val sc = new SparkContext() val events = sc.newAvroFile[IndexedRecord](path) ! // And then… val records = events.toRecords // or val eventFields = events.fields("sessionId", "location", "timestamp")
  • 37. // Kafka configuration. val consumerConfig = Map( "group.id" -> "some-id-for-the-consumer-group", "zookeeper.connect" -> "zookeeper-connect-string", "auto.commit.interval.ms" -> "5000", "auto.offset.reset" -> "largest" ) val topicSettings = Map("divolte" -> Runtime.getRuntime.availableProcessors()) ! val sc = new SparkContext() val ssc = new StreamingContext(sc, Seconds(15)) ! // Establish the source event stream. val stream = ssc.divolteStream[GenericRecord](consumerConfig, topicSettings, StorageLevel.MEMORY_ONLY) ! // And then… val eventStream = stream.toRecords // or val locationStream = stream.fields("location")
  • 38. Also in the box
  • 39. Zero config deploy • Easy to use for local development •Works out of the box with zero custom config • Comes with a built in schema and mapping •Works on local machine without Hadoop • Flushes to /tmp on local file system
  • 40. Collector has no global state • Load balancer friendly • Horizontally scalable • Shared nothing • (other than HDFS and Kafka)
  • 41. In stream de-duplication • The internet is a mean place; data will have noise • In stream hash based de-duplication • Low false negative rate • Virtually zero false positive rate • Requires URI based routing from load balancer • Easy to setup on nginx • Supported on many hardware load balancers
  • 42. Corrupt request detection • The internet is still a mean place… Some URLS are truncated • Incomplete events detected and discarded
  • 43. Defeat Chrome’s pre-rendering • Chrome sometimes speculatively pre-renders pages in the background • This triggers JS even if the page is not shown • Unless you use the Page Visibility API to detect this •Which we do •We take care of many other JS caveats as well
  • 44. Custom events • Divolte presents itself as a JS library • Map custom event parameters directly onto Avro fields <!-- client side --> <script> divolte.signal("addToBasket", { count: 2, productId: "a3bc38de" }) </script> // server side mapping eventType = eventType ! basketProductId { type = event_parameter name = productId }
  • 45. Bring your own IDs • Generate page view ID on server side • Possible to relate server side logging to page views and other client side events <script src="//…/divolte.js#a28de3bf42a5dc98c03" defer async> </script>
  • 46. User agent parsing • On the fly parsing of user agent string • Uses: http://uadetector.sourceforge.net/ • Updates user agent database at runtime without restart
  • 47. IP to geo coordinates • On the fly enrichment with geo coordinates based on IP address • MaxMind geoIP database • https://www.maxmind.com/en/geoip2-databases • Updates database at runtime without restart • Sets: • Latitude & longitude • Country, City, Subdivision
  • 49. GoDataDriven We’re hiring / Questions? / Thank you! @asnare / @fzk signal@godatadriven.com Andrew Snare / Friso van Vollenhoven