Powerful Google developer tools for immediate impact! (2023-24 C)
Divolte Collector - meetup presentation
1. Divolte Collector
Because life’s too short for log file parsing
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@asnare / @fzk
signal@godatadriven.com
Andrew Snare / Friso van Vollenhoven
4. Typical web optimization architecture
USER
HTTP request:
/org/apache/hadoop/io/IOUtils.html
log transport
service
log event:
2012-07-01T06:00:02.500Z /org/apache/hadoop/io/IOUtils.html
transport logs to
compute cluster
(e.g. recommendations) streaming log
off line analytics /
model training
serve model result
batch update
model state
processing
streaming update
model state
6. How did it get there?
Option 1: parse HTTP server logs
• Ship log files on a schedule
• Parse using MapReduce jobs
• Batch analytics jobs feed online systems
7. HTTP server log parsing
• Inherently batch oriented
• Schema-less (URL format is the schema)
• Initial job to parse logs into structured format
• Usually multiple versions of parsers required
• Requires sessionizing
• Logs usually have more than you ask for (bots,
image requests, spiders, health check, etc.)
8. Stream HTTP server logs
access.log
Message Queue or Event Transport
(Kafka, Flume, etc.)
EVENTS
tail -F
EVENTS
OTHER
CONSUMERS
9. How did it get there?
Option 2: stream HTTP server logs
• tail -F logfiles
• Use a queue for transport (e.g. Flume or Kafka)
• Parse logs on the fly
• Or write semi-schema’d logs, like JSON
• Parse again for batch work load
10. Stream HTTP server logs
• Allows for near real-time event handling when
consuming from queues
• Sessionizing? Duplicates? Bots?
• Still requires parser logic
• No schema
11. Tagging
tracking traffic
(asynchronous)
index.
html
script.
js
tracking server
access.log
web server
Message Queue or Event Transport
(Kafka, Flume, etc.)
EVENTS
OTHER
CONSUMERS
web page traffic
structured events
structured events
12. How did it get there?
Option 3: tagging
• Instrument pages with special ‘tag’, i.e. special
JavaScript or image just for logging the request
• Create special endpoint that handles the tag
request in a structured way
• Tag endpoint handles logging the events
13. Tagging
• Not a new idea (Google Analytics, Omniture,
etc.)
• Less garbage traffic, because a browser is
required to evaluate the tag
• Event logging is asynchronous
• Easier to do inflight processing (apply a schema,
add enrichments, etc.)
• Allows for custom events (other than page view)
14. Also…
• Manage session through cookies on the client
side
• Incoming data is already sessionized
• Extract additional information from clients
• Screen resolution
• Viewport size
• Timezone
17. Divolte Collector
tracking traffic
(asynchronous)
index.
html
script.
js
tracking server
access.log
web server
Message Queue or Event Transport
(Kafka, Flume, etc.)
EVENTS
OTHER
CONSUMERS
web page traffic
structured events
structured events
18. The TAG
<script src="//tr.example.com/divolte.js"
defer
async>
</script>
36. import io.divolte.spark.avro._
import org.apache.avro.generic.IndexedRecord
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
!
val sc = new SparkContext()
val events = sc.newAvroFile[IndexedRecord](path)
!
// And then…
val records = events.toRecords
// or
val eventFields = events.fields("sessionId", "location", "timestamp")
37. // Kafka configuration.
val consumerConfig = Map(
"group.id" -> "some-id-for-the-consumer-group",
"zookeeper.connect" -> "zookeeper-connect-string",
"auto.commit.interval.ms" -> "5000",
"auto.offset.reset" -> "largest"
)
val topicSettings = Map("divolte" -> Runtime.getRuntime.availableProcessors())
!
val sc = new SparkContext()
val ssc = new StreamingContext(sc, Seconds(15))
!
// Establish the source event stream.
val stream = ssc.divolteStream[GenericRecord](consumerConfig, topicSettings, StorageLevel.MEMORY_ONLY)
!
// And then…
val eventStream = stream.toRecords
// or
val locationStream = stream.fields("location")
39. Zero config deploy
• Easy to use for local development
•Works out of the box with zero custom config
• Comes with a built in schema and mapping
•Works on local machine without Hadoop
• Flushes to /tmp on local file system
40. Collector has no global state
• Load balancer friendly
• Horizontally scalable
• Shared nothing
• (other than HDFS and Kafka)
41. In stream de-duplication
• The internet is a mean place; data will have noise
• In stream hash based de-duplication
• Low false negative rate
• Virtually zero false positive rate
• Requires URI based routing from load balancer
• Easy to setup on nginx
• Supported on many hardware load balancers
42. Corrupt request detection
• The internet is still a mean place… Some URLS
are truncated
• Incomplete events detected and discarded
43. Defeat Chrome’s pre-rendering
• Chrome sometimes speculatively pre-renders
pages in the background
• This triggers JS even if the page is not shown
• Unless you use the Page Visibility API to detect
this
•Which we do
•We take care of many other JS caveats as well
44. Custom events
• Divolte presents itself as a JS library
• Map custom event parameters directly onto Avro
fields
<!-- client side -->
<script>
divolte.signal("addToBasket", {
count: 2,
productId: "a3bc38de"
})
</script>
// server side mapping
eventType = eventType
!
basketProductId {
type = event_parameter
name = productId
}
45. Bring your own IDs
• Generate page view ID on server side
• Possible to relate server side logging to page
views and other client side events
<script
src="//…/divolte.js#a28de3bf42a5dc98c03"
defer
async>
</script>
46. User agent parsing
• On the fly parsing of user agent string
• Uses: http://uadetector.sourceforge.net/
• Updates user agent database at runtime without
restart
47. IP to geo coordinates
• On the fly enrichment with geo coordinates
based on IP address
• MaxMind geoIP database
• https://www.maxmind.com/en/geoip2-databases
• Updates database at runtime without restart
• Sets:
• Latitude & longitude
• Country, City, Subdivision