This talk is about building Audi's big data platform from a first Hadoop PoC to a multi-tenant enterprise platform. Why a big data platform at all? We explain the requirements that drove the development of this platform and explain the decisions we had to make during this journey.
During the process of setting up our big data infrastructure, we often had to find the right balance between going for enterprise integration versus speed. For instance, whether to use the existing Active Directory for both LDAP and KDC versus setting up our own KDC. Using a shared enterprise service like Active Directory requires to follow certain naming conventions and restricted access, where running our own KDC brings much more flexibility but also adds another component to maintain to our platform. We show the advantages and disadvantages and explain why we've decided to choose a certain approach.
For data ingestion of both batch and streaming data, we use Apache Kafka. We explain why we installed a separate Kafka cluster from our Hadoop platform. We discuss the pros and cons of using the Kafka binary protocol and the HTTP REST protocol not only from a technical perspective but also from the organisational perspective as the source systems are required to push data into Kafka.
We give an overview of our current architecture including how some use cases are implemented on it. Some of them run exclusively on our new big data stack, while others use it in conjunction with our data warehouse. The use cases cover all different kinds of data from sensory data of robots in our plants to click streams from web applications.
Building an enterprise platform does not only consist of technical tasks but also of organizational tasks: data ownership, authorization to access certain data sets, or more financial ones like internal pricing and SLAs.
Although we have already achieved quite a lot, our journey has not yet ended. There are still some open topics to address, like providing a unified logging solution for applications spanning multiple platforms. Or finally offering a notebook-like Zeppelin to our analysts. Or addressing legal issues like GDPR.
We will conclude our talk with a short glimpse into our ongoing extension of our on-premises platform into a hybrid cloud platform.
Speakers
Carsten Herbe, Big Data Architect, Audi AG
Matthias Graunitz, Big Data Architect, Audi AG
Audi's Big Data Platform - Building Audi's Enterprise Big Data Platform
1. DataWorks Summit 2018 - Berlin
Building Audi's enterprise big data platform
Matthias Graunitz (AUDI AG, Germany)
Carsten Herbe (Audi Business Innovation GmbH, Germany)
2. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform2
WHO ARE WE?
3. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform3
Audi Group
Audi, Lamborghini, Ducati and Italdesign
4. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform4
Vorsprung is our promise
Strategy 2025
5. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform5
Audi Business Innovation GmbH
...is the development, establishment, sales and operation of
innovative concepts, products and services, as well the holding
of shares in the field of future mobility.
Audi mobility
innovations
Audi on demand
Audi balanced
technologies
Audi e-gas
Audi customer
IT solutions
6. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform6
About us
Matthias Graunitz
AUDI AG
» Center of Competence Big Data & BI
» Big Data Architect
» 10+ years Data Warehousing & BI
Carsten Herbe
Audi Business Innovation GmbH
» Data Platform & Solution Architecture
» Hadoop since 2013
» 10+ years Data Warehousing & BI
7. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform7
2 YEARS AGO…
STARTING BIG DATA
AT AUDI
8. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform8
Analytical Capabilities by 2015
! Data Domains
Finance
Purchase
Production
Quality
Sales
Car Data
Programs Projects Data Scientists
Embed Analytics
Analyze Data
Store, Distribute and Process Data
Deliver InformationSecure
Data
Infrastruc-
ture &
Services
Provision Data
Deliver Service
Manage
Infor-
mation
Design &
Maintain
Solutions
Authentifi-
cation
Data
Encryption
Auditing
Complex
Event
Processing
Analyitcal
APIs
Dash-
boarding
Planning &
Simulation
Visual
Analytics
BI Report &
OLAP
Statistical
Methods
Analytical
Script
Data
Warehouse
Analytical
Databases
ETL
Framework
Batch
Processing
Data Access /
APIs
On-Prem
Platform
Application
Deployment
Hardware,
Network, OS
Monitoring
Lifecycle
Mgmt
Development
Process &
Methods
Master Data
Mgmt
Data Lineage
AAP – AUDI ANALYTIC PLATTFORM
9. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform9
Analytical Capabilities by 2015
! Data Domains
Finance
Purchase
Production
Quality
Sales
Car Data
Programs Projects Data Scientists
Embed Analytics
Analyze Data
Store, Distribute and Process Data
Deliver InformationSecure
Data
Infrastruc-
ture &
Services
Provision Data
Deliver Service
Manage
Infor-
mation
Design &
Maintain
Solutions
Authentifi-
cation
Data
Encryption
Auditing
Complex
Event
Processing
Analyitcal
APIs
Dash-
boarding
Planning &
Simulation
Visual
Analytics
BI Report &
OLAP
Statistical
Methods
Analytical
Script
Data
Warehouse
Analytical
Databases
ETL
Framework
Batch
Processing
Data Access /
APIs
On-Prem
Platform
Application
Deployment
Hardware,
Network, OS
Monitoring
Lifecycle
Mgmt
Development
Process &
Methods
Master Data
Mgmt
Data Lineage
AAP – AUDI ANALYTIC PLATTFORM
10. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform10
Analytical Capabilities by 2015
! Data Domains
Finance
Purchase
Production
Quality
Sales
Car Data
Programs Projects Data Scientists
Embed Analytics
Analyze Data
Store, Distribute and Process Data
Deliver InformationSecure
Data
Infrastruc-
ture &
Services
Provision Data
Deliver Service
Manage
Infor-
mation
Design &
Maintain
Solutions
Authentifi-
cation
Data
Encryption
Auditing
Complex
Event
Processing
Analyitcal
APIs
Dash-
boarding
Planning &
Simulation
Visual
Analytics
BI Report &
OLAP
Statistical
Methods
Analytical
Script
Data
Warehouse
Analytical
Databases
ETL
Framework
Batch
Processing
Data Access /
APIs
On-Prem
Platform
Cloud
Platform
Application
Deployment
Hardware,
Network, OS
Monitoring
Lifecycle
Mgmt
Development
Process &
Methods
Master Data
Mgmt
Data Lineage
AAP – AUDI ANALYTIC PLATTFORM
File Systems
(HDFS)
Stream
Processing
Machine
Learning
11. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform11
Our first Hadoop Cluster 2015
Hadoop per node Sum
# data nodes 1 4
RAM 128 GB 0,5 TB
Cores 24 96
HDD* 40 TB 160 TB
DEV
* Raw Capacity without replication and FS overhead!
12. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform12
Our first attempt to walk
with Big Data Technologies
SCREWDRIVER
ANALYSIS
COMPANY CAR
ANALYSIS
13. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform13
ENTERPRISE INTEGRATION
VS
SPEED OF DELIVERY
14. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform14
Securing the Cluster as multi-tenant environment
Step by step by step towards our target architecture …
Access Control:
ACLs
User Management
Local OS users
Basic Security:
iptables + ssh tunneling
Authentication:
LDAP for Hive
Protection from outside:
Knox
Protection from inside
Kerberos
Access Control
File Attributes
Dedicated network:
BI Zone
Access Control & Audit
Ranger
User Management
LDAP
15. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform15
Legend: password required no password required next step
Password Hell
Hive
WebHDFS
SparkUI
HDFS/YARNKnox
Audi Active Directory:
[ AD User ]
Named User
Technical Hive User
DATA NODE 1 - X
NAME NODE 1 - 2
EDGE NODE 1 - 2
OS Level
[ Local User ]
OS Named User
Technical Hive User
Technical Project User
Hadoop User
SSH 2
EdgeNode
kinit
Hadoop KDC:
[ Kerberos Principal ]
Name User
Technical Hive User
Technical Project User
Hadoop User
16. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform16
DATA
INGESTION
17. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform17
Data ingestion:
technical requirements from projects, security and ops
» Streaming data
» Batch data
» easy writing to HDFS/DWH
INGESTION
» Data Sources should not directly be coupled to analytical backend jobs
» This allows adding new analytical jobs without changing the source
DECOUPLING
» Data ingestion must be available 24x7
» Data must be buffered (persisted) in case backend or backend job is not available
HA & BUFFERING
» Source systems must not connect directly to the data zone (Hadoop, DWH) – by IT Sec
» Authentication + Data in motion encryption (multi tenancy)
» Protocol must be auditable
» Some data sources run in the cloudSECURITY
» Amount of data will increase over time for most projects
» Number of projects will increase
SCALABILITY
18. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform18
Solution: Kerberized Confluent Kafka Platform
FW
BI
FW
MSG
FW
MSG
FW
SRC
#1
BI Data ZoneData Source network #1 AAP Messaging Zone
authenticationLegend: encrypted (SSL) not encrypted protocol / direction
Data Source network #n FW
SRC
#n
firewall pain point
Schema Registry
HTTP HTTP
none noneKafka Client
Kerberos
BIN / push
Kafka Client
Kerberos
BIN / push
HDFS Connector
BIN / pull
Hadoop KDC
Kerberos
HDFS
Kerberos
Spark Streaming
BIN / pull
Kerberos
DataProxy KDC
Kafka Broker
Kerberos Kerberos
BIN BIN
Zookeeper
Kerberos Kerberos
19. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform19
Edge Node
Kafka Distributed Connector: unsecured REST API
User Bob
Connector
Java
Process
Bob‘s Kafka
keytab
Bob‘s HDFS
keytab
HDFS Sink
Bob
HTTP
Bob’s
data
sink
config
Bob
topic Bob
User Eve
sink
config
Eve
File Sink
Eve
Bob’s
data
HDFS
Source
Eve
source
config
Eve
Legend: evil connection good connection
20. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform20
TODAY
CURRENT STATE
21. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform21
Architecture & Network Zones – Data Ingestion
Data Proxy
BI Data Zone
Messaging Zone
Data Warehouse
System A
System A
HDFS
Connector
Spark
StreamingCloud App
System
System
Legend: encrypted (SSL) not encrypted
22. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform22
Architecture & Network Zones – User & Developer Access
PIPE
BI Data Zone
Deployment Zone
BI Application Zone
AAP Data Warehouse
Audi Office LAN
Audi Laptop
Data Mining
Dashboarding
AAP
Remote Desktop
Legend: encrypted (SSL) not encrypted
23. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform23
Hadoop Cluster Sizing Production 2017
* Raw Capacity without replication and FS overhead!
Hadoop per node Sum
# data
nodes
1 12
RAM 512 GB 6 TB
Cores 24 288
HDD* 96 TB 9.216 TB
PROD
Kafka per node Sum
# broker
nodes
1 4
RAM 32 GB 128 GB
Cores 6 24
HDD* 4 TB 16 TB
PROD
24. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform24
Current state
Organisational Tasks
25. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform25
Organisational Tasks
Data Ownership & Data Governance
(Data Domain Modell with clear responsibility in each domain)
Lifecycle Management for each Shared Service in strong collaboration
with the projects and programs
Defined SLAs for each Shared Service based on general availability, data
loss, confidentiality and verifiability
Different Development Lifecycle between car and backend systems
Use of Open Source Software and Support requirements from IT
continuity
Balance between multi tenant environment and flexibility
Very long lifecycle of cars > 10 years with various built in software
versions
26. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform26
TOMORROW
WHAT’S UP NEXT
27. AUDI AG DataWorks Summit 2018 - Building Audi's enterprise big data platform27
Hybrid Approach for the AAP
Public Cloud
On Premise /
private Cloud
Entry Zone Application Zone Data Zone
Web
GATEWAY
Full Client
(Tableau, BO, etc.)
Web Client
(Tableau, BO, etc.)
HDP
Data
Warehouse
Messaging Zone
Kafka
Internet
RDP
GATEWAY
Business User
Ingestor 1*
Repositories
Knox
Direct Cloud Connect
Swarm VPC
KafkaData
Inventory
Analytical VPC
Ingestor
HDP
Knox