SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Apache Kafka at
trivago
2017-01-25, Munich, Germany
Clemens Valiente
Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Senior Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente
3
As a hotel price comparison engine, our most
valuable information are hotel prices.
They are not only shown to our visitors to
support their hotel booking decision, but also
stored and later analyzed by Business
Intelligence.
With over one million hotels and all major
booking websites connected to our system, we
have one of the most complete sources of
information on hotel price development and
trends
Collecting price information for BI
4
The past: Data pipeline 2010 – 2015
5
The past: Data pipeline 2010 – 2015
Java Software
Engineering
6
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
7
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
8
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
9
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
10
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
Size of data
- We collected a total of 56
billion prices in those five
years
- Towards the end of this
pipeline in early 2015 on
average around 100 million
prices per day were written
to BI
11
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
12
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
13
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
14
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
15
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
16
Refactoring the pipeline: Requirements
• Scales with an arbitrary amount of data (future proof)
• reliable and resilient
• low performance impact on Java backend
• long term storage of raw input data
• fast processing of filtered and aggregated data
• Open source
• we want to log everything:
• more prices
• Length of stay, room type, breakfast info, room category, domain
• with more information
• Net & gross price, city tax, resort fee, affiliate fee, VAT
17
Present data pipeline 2017 – ingestion
Düsseldorf
18
Present data pipeline 2017 – ingestion
Düsseldorf
19
Present data pipeline 2017 – ingestion
San Francisco
Düsseldorf
Hongkong
20
Present data pipeline 2017 – processing
Camus
21
Present data pipeline 2017 – results after two
years in production
• Very reliable, barely any downtime or service interruptions of the system
• Java team is very happy – less load on their system
• BI team is very happy – more data, more resources to process it
• stakeholders very happy
• Faster results
• Better quality of results due to more data
• More detailed results
• => Shorter research phase, more and better stories
• => Less requests & workload for BI
22
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
23
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics
24
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics
Camus
- Mapreduce application that
writes prices to hdfs
- 15 Mappers running in
parallel
- Pretty much continuously
in 10 minute intervals
- To be replaced by
Gobblin/Kafka Connect
25
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
26
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka
27
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka
Status quo
- Our entire BI business
logic runs on and through
the kafka – hadoop
pipeline
- Almost all departments rely
on data, insights and
metrics delivered by
hadoop
- Most of the company could
not do their job without
hadoop data
28
Düsseldorf
Leipzig Palma
Ongoing Projects: Breaking up the Monolith
29
Düsseldorf
PalmaLeipzig
30
Key challenges and learnings
●
Settle on a common message format (Avro/Protobuf, not csv or json)
●
A common message envelope is helpful (e.g. header with timestamp and
sender)
●
For stream processing repeat your key in your message value
●
Monitor your consumer offsets with an audit log, especially across data
centres
●
Turn off auto creation of topics, but have a process in place for topic creation
Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Senior Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente
Thank you!
Questions
and
comments?
●
Thanks to Jan Filipiak for his brainpower behind most
projects
●
Additional resources:
●
https://github.com/trivago/gollum A n:m message
multiplexer written in Go
●
https://github.com/trivago/triava TriavaCache, JSR107
compliant cache

Contenu connexe

Tendances

How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysHow Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysSpark Summit
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaBest Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaKai Wähner
 
Microservices Integration Patterns with Kafka
Microservices Integration Patterns with KafkaMicroservices Integration Patterns with Kafka
Microservices Integration Patterns with KafkaKasun Indrasiri
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview VMware Tanzu
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache KafkaAmir Sedighi
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...HostedbyConfluent
 
Apache Kafka® Use Cases for Financial Services
Apache Kafka® Use Cases for Financial ServicesApache Kafka® Use Cases for Financial Services
Apache Kafka® Use Cases for Financial Servicesconfluent
 
Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center   Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center confluent
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...HostedbyConfluent
 
Mainframe Integration, Offloading and Replacement with Apache Kafka
Mainframe Integration, Offloading and Replacement with Apache KafkaMainframe Integration, Offloading and Replacement with Apache Kafka
Mainframe Integration, Offloading and Replacement with Apache KafkaKai Wähner
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...HostedbyConfluent
 

Tendances (20)

How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysHow Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaBest Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Microservices Integration Patterns with Kafka
Microservices Integration Patterns with KafkaMicroservices Integration Patterns with Kafka
Microservices Integration Patterns with Kafka
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
 
Apache Kafka® Use Cases for Financial Services
Apache Kafka® Use Cases for Financial ServicesApache Kafka® Use Cases for Financial Services
Apache Kafka® Use Cases for Financial Services
 
Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center   Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
 
Mainframe Integration, Offloading and Replacement with Apache Kafka
Mainframe Integration, Offloading and Replacement with Apache KafkaMainframe Integration, Offloading and Replacement with Apache Kafka
Mainframe Integration, Offloading and Replacement with Apache Kafka
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
 

Similaire à Kafka at trivago

Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Clemens Valiente
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Databricks
 
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...Sempl 21
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSouth Tyrol Free Software Conference
 
How open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open OyHow open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open OyMindtrek
 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...yalisassoon
 
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmPresentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmAlexander Oppel
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
 
Customer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewCustomer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewGuido Schmutz
 
Big Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data ManagementBig Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data ManagementMatt Stubbs
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise ArchitectsNeo4j
 
Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Objectivity
 
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...mateja repovž
 
Data-informed Experience Design
Data-informed Experience DesignData-informed Experience Design
Data-informed Experience DesignInformaat
 
UX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der HaarUX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der HaarUX STRAT
 
OVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceOVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceMojtaba Imani
 
Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016Partha Sarathi Pattnaik
 
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...AWS Germany
 

Similaire à Kafka at trivago (20)

Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
 
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
 
How open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open OyHow open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open Oy
 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
 
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmPresentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Customer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewCustomer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° view
 
Big Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data ManagementBig Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data Management
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs
 
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
 
Data-informed Experience Design
Data-informed Experience DesignData-informed Experience Design
Data-informed Experience Design
 
UX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der HaarUX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der Haar
 
OVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceOVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a Service
 
Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016
 
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
 

Dernier

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Kafka at trivago

  • 1. Apache Kafka at trivago 2017-01-25, Munich, Germany Clemens Valiente
  • 2. Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Clemens Valiente
  • 3. 3 As a hotel price comparison engine, our most valuable information are hotel prices. They are not only shown to our visitors to support their hotel booking decision, but also stored and later analyzed by Business Intelligence. With over one million hotels and all major booking websites connected to our system, we have one of the most complete sources of information on hotel price development and trends Collecting price information for BI
  • 4. 4 The past: Data pipeline 2010 – 2015
  • 5. 5 The past: Data pipeline 2010 – 2015 Java Software Engineering
  • 6. 6 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 7. 7 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 8. 8 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years
  • 9. 9 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years Restrictions - Only single night stays - Only prices from European visitors - Prices cached up to 30 minutes - One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins
  • 10. 10 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years Restrictions - Only single night stays - Only prices from European visitors - Prices cached up to 30 minutes - One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins Size of data - We collected a total of 56 billion prices in those five years - Towards the end of this pipeline in early 2015 on average around 100 million prices per day were written to BI
  • 11. 11 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 12. 12 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 13. 13 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 14. 14 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 15. 15 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 16. 16 Refactoring the pipeline: Requirements • Scales with an arbitrary amount of data (future proof) • reliable and resilient • low performance impact on Java backend • long term storage of raw input data • fast processing of filtered and aggregated data • Open source • we want to log everything: • more prices • Length of stay, room type, breakfast info, room category, domain • with more information • Net & gross price, city tax, resort fee, affiliate fee, VAT
  • 17. 17 Present data pipeline 2017 – ingestion Düsseldorf
  • 18. 18 Present data pipeline 2017 – ingestion Düsseldorf
  • 19. 19 Present data pipeline 2017 – ingestion San Francisco Düsseldorf Hongkong
  • 20. 20 Present data pipeline 2017 – processing Camus
  • 21. 21 Present data pipeline 2017 – results after two years in production • Very reliable, barely any downtime or service interruptions of the system • Java team is very happy – less load on their system • BI team is very happy – more data, more resources to process it • stakeholders very happy • Faster results • Better quality of results due to more data • More detailed results • => Shorter research phase, more and better stories • => Less requests & workload for BI
  • 22. 22 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing
  • 23. 23 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing Data Size (price log) - Over 4 trillion messages collected so far - 10 billion messages/day - Over a hundred topics
  • 24. 24 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing Data Size (price log) - Over 4 trillion messages collected so far - 10 billion messages/day - Over a hundred topics Camus - Mapreduce application that writes prices to hdfs - 15 Mappers running in parallel - Pretty much continuously in 10 minute intervals - To be replaced by Gobblin/Kafka Connect
  • 25. 25 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors
  • 26. 26 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors Other data sources and usage - Clicklog information from our website and mobile app - Used for marketing performance analysis, product tests, invoice generation etc - Every Euro of revenue at some point was a message in Kafka
  • 27. 27 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors Other data sources and usage - Clicklog information from our website and mobile app - Used for marketing performance analysis, product tests, invoice generation etc - Every Euro of revenue at some point was a message in Kafka Status quo - Our entire BI business logic runs on and through the kafka – hadoop pipeline - Almost all departments rely on data, insights and metrics delivered by hadoop - Most of the company could not do their job without hadoop data
  • 30. 30 Key challenges and learnings ● Settle on a common message format (Avro/Protobuf, not csv or json) ● A common message envelope is helpful (e.g. header with timestamp and sender) ● For stream processing repeat your key in your message value ● Monitor your consumer offsets with an audit log, especially across data centres ● Turn off auto creation of topics, but have a process in place for topic creation
  • 31. Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Clemens Valiente Thank you! Questions and comments?
  • 32. ● Thanks to Jan Filipiak for his brainpower behind most projects ● Additional resources: ● https://github.com/trivago/gollum A n:m message multiplexer written in Go ● https://github.com/trivago/triava TriavaCache, JSR107 compliant cache