3. Who we are?
Our vision is to revolutionize the KPIs and metrics the online
advertisement industry currently using. With our products,
Antifraud, Brandsafety and Viewability we provide actionable
data to our customers.
9. How we do? DATA PLATFORM
...so we need do analyze vast amount of data
Infrastucture Big Data
technologies
+
enbrite.ly
data
platform
=
10. Amazon Web Services (AWS)
● Most popular cloud service provider
● ~70 services, 13 geographical "regions"
● Amazon Big Data = Elastic Map Reduce
● BUT Do not trust the BIG guy (API problem)
https://aws.amazon.com/
11. Apache Hadoop
● de facto Big Data technology
● open source software
● distributed storage (HDFS) + data processing
(MapReduce)
● ecosystem: many additional softwares
http://hadoop.apache.org/ | https://github.com/apache/hadoop
12. Apache Spark
● large-scale data processing engine
● open source software (popular)
● modules: core, sql, sreaming, graph, ML
● faster than Hadoop MapReduce
http://spark.apache.org/ | https://github.com/apache/spark
13. Data platform in numbers
20+ node cluster
16 services 110 servers
0.5 - 4 TB /day
100+ TB on S3
17. Real world example
You have a simple idea to detect bot traffic, which saves
the world. Let’s implement it!
18. Real world example
THE IDEA: Analyse events which are too hasty and deviate from
regular, humanlike profiles: too many clicks in a defined
timeframe.
INPUT: Collected events on Amazon S3
OUTPUT: Invalid sessions
19. Step 1: sessionize events
How to solve it?
Step 2: detect too many clicks
code: https://github.com/enbritely/startup-safary