Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
3. Why Cloud?
- Provides a convenient way to test Hadoop changes at scale
- Temporarily rapidly grow / shrink
- A broader geographical footprint for locality and business continuity
- Access to other Google offerings such as BigQuery, CloudML, Cloud
DataFlow etc
4. Partly Cloudy
A project to extend Data Processing at Twitter
from an on-premises only model
to a hybrid on-premises and Cloud model
7. Design considerations
User Experience
Consistency in
user experience
for on-premises
& in cloud data
processing
Scalability
Ability scale out
to handle all
datasets & all
users from day 1
Onboarding
Seamless
onboarding
experience
New Avenues
Data access in
new processing
tools in cloud
8. Design principles
Authentication
Strong authentication
for all user and service
access to data
Authorization
Explicit authorization for
all user and service
access to data
Least privileged access
Audit
Ability to easily
determine who
performed what actions
on the data
9. Workstreams
● Various focus areas across the tech stack
○ Networking
○ GCP config
○ Replication
○ Data Processing Tools
○ Internal services
● Collaboration across teams within Twitter
● Collaboration with Google
11. Data Infrastructure for Analytics
`
Hadoop Cluster
Data
Access
Layer
Replication Service
Retention Service
Hadoop Cluster
Replication Service
Retention Service
12. Extending Replication to GCS
DataCenter 2DataCenter 1
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Same dataset
available on GCS for
users
● Unlock Presto on
GCP, Hadoop on
GCP, BigQuery and
other tools
19. Partly Cloudy Resource Hierarchy
TWITTER Org
DATA INFRA
Folder
twitter-product
twitter-revenue
twitter-infraeng GCP
Projects
20. Project
Dataset
bucket
User Bucket
Google Cloud Storage
Connector for Hadoop
Google Cloud Storage
Connector for Hadoop
Nest
Name
Nodes
Worker Nodes
Resource
Manager
Task
ViewFS filesystem layer
ViewFS filesystem layer
Shadow account based
access
User account based access
User account based access
Scratch
bucket
Scrubbed
bucket
Project contents
21. GCP Project ZGCP Project YGCP Project X
Replicators per project
Twitter DataCenter
Copy Cluster
/gcs/dataX/2019/
04/10/03
/gcs/dataY/2019/
04/10/03
/gcs/dataZ/2019/
04/10/03
DistcpDistcp
DistcpDistcp DistcpDistcp
Replicator X Replicator Y Replicator Z
Cloud Storage Cloud Storage Cloud Storage
28. Key Management
- A new key is generated every N days
- Each key is valid for 2N + N days
- Keys are distributed to compute nodes by Twitter’s key
distribution service
- The shadow account key is readable only by that user
- Key management & distribution is transparent to the user
31. What are DemiGod services
Demigod is a group of service(s) that are responsible for
configuring GCP for Twitter’s Data Platform.
They run in GCP.
32. Salient features of DemiGods
- Run asynchronously of each other.
- Run with exactly-scoped, privileged google service accounts
- Idempotent runs
- Puppet-like functionality. Will override any manual changes
- Modular in design
- Each kept as simple as possible
33. Twitter infra eng project Twitter product project
Partly Cloudy Admin Project
Twitter user project
bucket-creation
-ie org (svc-acc-ie)
bucket-creation
Product (svc-acc-
product)
shadow-user-creation
policy-granting-ie
Key/
Secrets
store
LDAP/Googl
e Groups
GCS Config
bucket
key-
rotation/creation
Deployment of DemiGods
34. What
do the
Data Processing
Users
at Twitter get
❏ Datasets replicated on GCS
❏ A shadow account to access GCS
❏ GCS buckets for their scratch &
scrubbed data
❏ Access to a Twitter managed
Hadoop cluster in GCP
❏ Access to a Twitter managed Presto
cluster in GCP
❏ Exploring other Google offerings
(such as BigQuery, DataProc & DataFlow)
35. ● Copied tens of petabytes of data
and keeping them in sync
● Tens of different projects with
hundreds of buckets
● Complex set of VPC rules
● Hundreds of users using GCP
● Unlocked multiple use cases on
GCP
Where are we today
To transfer data from on-premise to GCS
Runs only yarn for GCS transfer, no local data
Security
Minimal in-DC hosts connect to GCS
Networking
Dedicated high bandwidth
Requires separate dedicated configuration for routing to public end-points
Each worker node has two IP addresses.
Our DC RFC space that can't be used on the public Internet
GCS traffic uses public IP
Internal traffic (reading from cluster, observability, puppet, etc) uses internal IP
Data is identified by a dataset name
HDFS is the primary storage for Analytics
Users configure replication rules for different clusters
Dataset also has retention rules defined per cluster
Dataset are always represented on fixed interval partitions (hourly/daily)
Dataset is defined in system called Data Access Layer (DAL)*
Data is made available at different destination using Replicator
Long running daemon (on mesos)
Daemon checks configuration and schedules copy on hourly partition
Copy jobs are executed as Hadoop distcp jobs
Jobs are on destination cluster
After hourly copy, publish partition to DAL
Some datasets are collected across multiple DataCenters
Replicator would kick off multiple DistCP jobs to copy at tmp location
Replicator then merges dataset into single directory and does atomic rename to final destination
Renames on HDFS are cheap and atomic, which makes this operation easy
Use same Replicator code to sync data to GCS
Utilize ViewFileSystem abstraction to hide GCS
/gcs/dataset/2019/04/10/03 maps to gs://dataset.bucket/2019/04/10/03
Use Google Hadoop Connector to interact with GCS using Hadoop APIs
Distcp jobs runs on dedicated Copy cluster
Create ViewFileSystem mount point on Copy cluster to fake GCS destination
Distcp tasks stream data from source HDFS to GCS (no local copy)
Data for same dataset is aggregated at multiple DataCenters (DC x and DC y)
Replicators in each DC schedules individual DistCp jobs
Data from multiple DC ends up under same path on GCS
UI support via EagleEye to view all replication configurations
Properties associated with configuration. Src, dest, owner, email, etc…
CLI support to manage replication configurations
Load new or modify existing configuration
List all configurations
Mark active/inactive configurations
API support for clients and replicators
Rich set of api access for all above operations
GCP Projects are based on organization
Deploy separate Replicator with its own credentials per project
Shared copy cluster per DataCenter
Enables independent updates and reduces risk of errors
Logs vs user path resolution
Projects and buckets have standard naming convention
Logs at : gs://logs.<category name>.twttr.net/
User data at gs://user.<user name>.twtter.net/
Access to these buckets are via standard path
Logs at /gcs/logs/<category name>/
User data at /gcs/user/<user name>/
Typically we need mapping of path prefix to bucket name in Hadoop ViewFileSystem mountable.xml
Modified ViewFileSystem to dynamically create mountable mapping on demand since bucket name and path name are standard
No configuration or update needed