3. Agenda
Describe Twitter’s Data
Replicator Architecture,
present our solution to extend it
to Google Cloud Storage
and maintain consistent
interface for users.
Tweet questions
#GoogleNext19Twitter
4. Twitter DataCenter
Data Infrastructure for Analytics
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Data
Generate > 1.5
Trillion events
every day
Incoming
Storage
Produce > 4PB
per day
Production
jobs
Process
hundreds of PB
per day
Ad hoc queries
Executes tens
of thousands
of jobs per day
Cold/Backup
Hundreds of
PBs of data
Streaming systems
5. Data Infrastructure for Analytics
`
Hadoop Cluster
Data
Access
Layer
Replication Service
Retention Service
Hadoop Cluster
Replication Service
Retention Service
6. Data Access Layer
● Dataset has logical name
and one or more physical
locations
● Users/Tools such as
scalding, presto, HIVE
query DAL for available
hourly partitions
● Dataset has hourly/daily
partitions in DAL
● Also stores various
properties such as owner,
schema, location with
datasets
* https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
11. DataCenter 2DataCenter 1
Need for Replication
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Thousands of
datasets configured
for replication
● Across tens of
different clusters
● Data kept in sync
hourly/daily/snapshot
● Fault tolerant
12. Data Replicator
● Replicator per destination
● 1 : 1 Copy from src to
destination
● N : 1 Copy + Merge from
multiple src to destination
● Publish to DAL upon
completion
Copy
Source
Cluster
Destination
Cluster
Replicator
Copy + Merge
Source
Cluster
Destination
Cluster
Replicator
Source
Cluster
13. Dataset : partly-cloudy
Src Cluster : ClusterX
Src path : /logs/partly-cloudy
Dest Cluster : ClusterY
Dest path : /logs/partly-cloudy
Copy Since : 3 days
Owner : hadoop-team
Replication setup
Data Access
Layer
Replicator
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
16. Extending Replication to GCS
DataCenter 2DataCenter 1
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Same dataset
available on GCS for
users
● Unlock Presto on
GCP, Hadoop on
GCP, BigQuery and
other tools
Cloud Storage
17. Extending Replication to GCS
DataCenter 1
Hadoop
Cluster
BigQuery
GCE VMs
● Same dataset available
on GCS for users
● Unlock Presto on GCP,
Hadoop on GCP,
BigQuery and other
tools
Cloud Storage
18. Bucket on GCS : gs://logs.partly-cloudy
View FileSystem and Google Hadoop Connector
Cloud Storage
19. Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Cloud Storage
Connector
Cloud Storage
22. Twitter DataCenter
Network setup for copy
Twitter & Google private
peering (PNI)
Copy Cluster
GCS
/gcs/logs/partly-
cloudy/2019/04/
10/03
Distcp
Replicator : GCS
Proxy
group
23. Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/04/
10/03
Source ClusterX-2
/ClusterX-2/logs/partly-
cloudy//2019/04/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/ClusterX-1/logs/partly-
cloudy/2019/04/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage
24. Merging and updating DAL
● Multiple Replicators copy same
dataset partition to destination
● Each of Replicator checks for
availability of data independently
● Creates individual
_SUCCESS_<SRC> files
● Updates DAL when all
_SUCCESS_<SRC> are found
● Updates are idempotent
Compare
src and
dest
Kick of
distcp job
Check
success
file (ALL)
Update
DAL
Success
Let other
instance
update
DAL
Need to
copy
Copied
already
Success
Failure
No
Yes
Done
Each Replicator updates partition
independently
26. Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition
27. Query dataset
$dal logical-dataset list --role hadoop --name logs.partly-cloudy
| 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active |
$dal physical-dataset list --role hadoop --name logs.partly-cloudy
| 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly-
cloudy/yyyy/mm/dd/hh |
| 41065 | http://dalpds/41065 | gcs |
gcs:///logs/partly-cloudy/yyyy/mm/dd/hh |
List all physical locations
Find dataset by logical name
28. Query partitions of dataset
$dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs
2019-04-01T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/04/01/11
HadoopLzop
2019-04-01T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/04/01/12
HadoopLzop
2019-04-01T13:00:00Z 2019-04-01T14:00:00Z gcs:///logs/partly-cloudy/2019/04/01/13
HadoopLzop
2019-04-01T14:00:00Z 2019-04-01T15:00:00Z gcs:///logs/partly-cloudy/2019/04/01/14
HadoopLzop
2019-04-01T15:00:00Z 2019-04-01T16:00:00Z gcs:///logs/partly-cloudy/2019/04/01/15
HadoopLzop
2019-04-01T16:00:00Z 2019-04-01T17:00:00Z gcs:///logs/partly-cloudy/2019/04/01/16
HadoopLzop
All partitions for dataset on GCS
29. Monitoring
● Rich set of
monitoring for
Replicator and
replicator configs
● Uniform monitoring
dashboard for
onprem and cloud
replicators
Read/Write bytes per destination
Latency per destination
30. 9. Alerting
● Fine tuned alert configs per metric per
replicator
● Pages on call for critical issues
● Uniform alert dashboard and config for
onprem and cloud replicators
31. GCP Project ZGCP Project YGCP Project X
Replicators per project
Twitter DataCenter
Copy Cluster
/gcs/dataX/2019/0
4/10/03
/gcs/dataY/2019/0
4/10/03
/gcs/dataZ/2019/04
/10/03
DistcpDistcp
DistcpDistcp DistcpDistcp
Replicator X Replicator Y Replicator Z
Cloud Storage Cloud Storage Cloud Storage
33. Where are we today
● Tens of instances of GCS
Replicators
● Copied tens of petabytes of data
● Hundreds of thousands of copy
jobs
● Unlocked multiple use cases on
GCP
35. Google Storage Hadoop connector
● Checksum mismatch between Hadoop FileSystem and Google Cloud Storage
○ Composite checksum HDFS-13056
○ More details in blog post*
● Proxy configuration as path
● Per user credentials
● Lazy initialization to support View FileSystem
* https://cloud.google.com/blog/products/storage-data-transfer/new-file-checksum-feature-lets-you-validate-data-transfers-between-hdfs-and-cloud-storage
36. Performance and Consistency
● Performance optimization uncovered during evaluation Presto on GCP
● Cooperative locking in Google Connector for atomic renames
○ https://github.com/GoogleCloudPlatform/bigdata-interop/tree/cooperative_locking
● Same version of connector (onprem and open source)
37. Summary
Describe Twitter’s Data Replicator Architecture,
present our solution to extend it to Google Cloud Storage
and maintain consistent interface for users.
40. Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comments
Rate icon in
status bar
Twitter’s Data Replicator for GCS at GoogleNext 2019. Lohit VijayaRenu, Twitter
Data is identified by a dataset name
HDFS is the primary storage for Analytics
Users configure replication rules for different clusters
Dataset also has retention rules defined per cluster
Dataset are always represented on fixed interval partitions (hourly/daily)
Dataset is defined in system called Data Access Layer (DAL)*
Data is made available at different destination using Replicator
All systems rely on global filesystem paths
/cluster1/dataset-1/2019/04/10/03
/cluster3/user/larry/dataset-5/2019/04/10/03
Build on Hadoop ViewFileSystem*
Each path prefix is mapped to specific cluster’s configuration
Makes it very easy to discover data from location
Replicator uses this to resolve path across clusters
Can hide different FileSystem implementations behind ViewFileSystem
Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
All systems rely on global filesystem paths
/cluster1/dataset-1/2019/04/10/03
/cluster3/user/larry/dataset-5/2019/04/10/03
Build on Hadoop ViewFileSystem*
Each path prefix is mapped to specific cluster’s configuration
Makes it very easy to discover data from location
Replicator uses this to resolve path across clusters
Can hide different FileSystem implementations behind ViewFileSystem
Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
All systems rely on global filesystem paths
/cluster1/dataset-1/2019/04/10/03
/cluster3/user/larry/dataset-5/2019/04/10/03
Build on Hadoop ViewFileSystem*
Each path prefix is mapped to specific cluster’s configuration
Makes it very easy to discover data from location
Replicator uses this to resolve path across clusters
Can hide different FileSystem implementations behind ViewFileSystem
Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
All systems rely on global filesystem paths
/cluster1/dataset-1/2019/04/10/03
/cluster3/user/larry/dataset-5/2019/04/10/03
Build on Hadoop ViewFileSystem*
Each path prefix is mapped to specific cluster’s configuration
Makes it very easy to discover data from location
Replicator uses this to resolve path across clusters
Can hide different FileSystem implementations behind ViewFileSystem
Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
Run one Replicator per destination cluster
Always pull model
Fault tolerant
1:1 or N:1 setup
Upon copy, publish to DAL
Users setup replication entry one per Dataset with properties
Source and Destination clusters
Copy since X days. (Optionally copy until Y days)
Owner, team, contact email
Copy job configuration
Different ways to specify configuration. yml , DAL, configdb.
Configure contact email for alerts.
Fault tolerant copy to keep data in sync.
Long running daemon (on mesos)
Daemon checks configuration and schedules copy on hourly partition
Copy jobs are executed as Hadoop distcp jobs
Jobs are on destination cluster
After hourly copy, publish partition to DAL
Some datasets are collected across multiple DataCenters
Replicator would kick off multiple DistCP jobs to copy at tmp location
Replicator then merges dataset into single directory and does atomic rename to final destination
Renames on HDFS are cheap and atomic, which makes this operation easy
Use same Replicator code to sync data to GCS
Utilize ViewFileSystem abstraction to hide GCS
/gcs/dataset/2019/04/10/03 maps to gs://dataset.bucket/2019/04/10/03
Use Google Hadoop Connector to interact with GCS using Hadoop APIs
Distcp jobs runs on dedicated Copy cluster
Create ViewFileSystem mount point on Copy cluster to fake GCS destination
Distcp tasks stream data from source HDFS to GCS (no local copy)
Replicator Daemon uses proxy
While actual data flow directly to GCP from Twitter
PNI setup between Twitter and Google
Data for same dataset is aggregated at multiple DataCenters (DC x and DC y)
Replicators in each DC schedules individual DistCp jobs
Data from multiple DC ends up under same path on GCS
UI support via EagleEye to view all replication configurations
Properties associated with configuration. Src, dest, owner, email, etc…
CLI support to manage replication configurations
Load new or modify existing configuration
List all configurations
Mark active/inactive configurations
API support for clients and replicators
Rich set of api access for all above operations
Command line tools
dal command line to look up datasets, destination and available partitions
API access to DAL
Scalding/Presto query DAL to check partitions for time range
Jobs also link to scheduler which can kick off jobs based on new partitions
UI access
EagleEye is UI to view details about Datasets and also available partitions
Can also who delay per hourly partitions
Uniform access on prem or cloud
Interface to dataset properties are same on prem or cloud
GCP Projects are based on organization
Deploy separate Replicator with its own credentials per project
Shared copy cluster per DataCenter
Enables independent updates and reduces risk of errors
Logs vs user path resolution
Projects and buckets have standard naming convention
Logs at : gs://logs.<category name>.twttr.net/
User data at gs://user.<user name>.twtter.net/
Access to these buckets are via standard path
Logs at /gcs/logs/<category name>/
User data at /gcs/user/<user name>/
Typically we need mapping of path prefix to bucket name in Hadoop ViewFileSystem mountable.xml
Modified ViewFileSystem to dynamically create mountable mapping on demand since bucket name and path name are standard
No configuration or update needed
Google Cloud Storage connector to access gcs
Existing applications using Hadoop FileSystem APIs continue to work
Existing tools continue to work against GCS
For most part users do not know there is separate tool / API to access GCS
Users use commands such as
hadoop fs -ls /gcs/user/larry/my_dataset/2019/01/04
hadoop fs -du -s -h /gcs/user/larry/my_dataset
hadoop fs -put /gcs/user/larry/my_dataset/2019/01/04/file.txt ./file.txt
Hadoop Cloud Storage connector is installed along with hadoop client on jump hosts and hadoop nodes
Applications can also package connector jar
Google supports data at petabyte scale, securely with our best in class analytics and machine learning capabilities to inform real-time decisions and coordinate response on the roads.