SlideShare une entreprise Scribd logo
1  sur  41
DA300:
How Twitter Replicates
Petabytes of Data to
Google Cloud Storage
Lohit VijayaRenu, Twitter
@lohitvijayarenu
Agenda
Describe Twitter’s Data
Replicator Architecture,
present our solution to extend it
to Google Cloud Storage
and maintain consistent
interface for users.
Tweet questions
#GoogleNext19Twitter
Twitter DataCenter
Data Infrastructure for Analytics
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Data
Generate > 1.5
Trillion events
every day
Incoming
Storage
Produce > 4PB
per day
Production
jobs
Process
hundreds of PB
per day
Ad hoc queries
Executes tens
of thousands
of jobs per day
Cold/Backup
Hundreds of
PBs of data
Streaming systems
Data Infrastructure for Analytics
`
Hadoop Cluster
Data
Access
Layer
Replication Service
Retention Service
Hadoop Cluster
Replication Service
Retention Service
Data Access Layer
● Dataset has logical name
and one or more physical
locations
● Users/Tools such as
scalding, presto, HIVE
query DAL for available
hourly partitions
● Dataset has hourly/daily
partitions in DAL
● Also stores various
properties such as owner,
schema, location with
datasets
* https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
Namespace
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
ClusterZ
Namespace 2 Namespace 1
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 2 Namespace 1
DataCenter-1 DataCenter-2
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy
Path on Twitter’s HDFS Clusters* : /DataCenter-1/cluster-X/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2
Replicator
DataCenter 2DataCenter 1
Need for Replication
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Thousands of
datasets configured
for replication
● Across tens of
different clusters
● Data kept in sync
hourly/daily/snapshot
● Fault tolerant
Data Replicator
● Replicator per destination
● 1 : 1 Copy from src to
destination
● N : 1 Copy + Merge from
multiple src to destination
● Publish to DAL upon
completion
Copy
Source
Cluster
Destination
Cluster
Replicator
Copy + Merge
Source
Cluster
Destination
Cluster
Replicator
Source
Cluster
Dataset : partly-cloudy
Src Cluster : ClusterX
Src path : /logs/partly-cloudy
Dest Cluster : ClusterY
Dest path : /logs/partly-cloudy
Copy Since : 3 days
Owner : hadoop-team
Replication setup
Data Access
Layer
Replicator
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy + Merge
Source Cluster
/ClusterX-2/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX-1/logs/partly-cloudy
/ClusterX-2/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Type : Multiple Src
Source Cluster
/ClusterX-1/logs/partly-cloudy/
2019/04/10/03
Distcp
2019/04/10/03
Merge
Extending Replication to GCS
DataCenter 2DataCenter 1
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Same dataset
available on GCS for
users
● Unlock Presto on
GCP, Hadoop on
GCP, BigQuery and
other tools
Cloud Storage
Extending Replication to GCS
DataCenter 1
Hadoop
Cluster
BigQuery
GCE VMs
● Same dataset available
on GCS for users
● Unlock Presto on GCP,
Hadoop on GCP,
BigQuery and other
tools
Cloud Storage
Bucket on GCS : gs://logs.partly-cloudy
View FileSystem and Google Hadoop Connector
Cloud Storage
Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Cloud Storage
Connector
Cloud Storage
Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
Twitter Resolved Path : /gcs/logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2 Cloud Storage
Connector
Replicator
Cloud Storage
Twitter
DataCenter
Architecture behind GCS replication
Copy Cluster
GCS
/gcs/logs/partly-cloud
/2019/04/10/03
Replicator : GCS
DAL
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Distcp
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/gcs/logs/partly-cloudy
Twitter DataCenter
Network setup for copy
Twitter & Google private
peering (PNI)
Copy Cluster
GCS
/gcs/logs/partly-
cloudy/2019/04/
10/03
Distcp
Replicator : GCS
Proxy
group
Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/04/
10/03
Source ClusterX-2
/ClusterX-2/logs/partly-
cloudy//2019/04/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/ClusterX-1/logs/partly-
cloudy/2019/04/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage
Merging and updating DAL
● Multiple Replicators copy same
dataset partition to destination
● Each of Replicator checks for
availability of data independently
● Creates individual
_SUCCESS_<SRC> files
● Updates DAL when all
_SUCCESS_<SRC> are found
● Updates are idempotent
Compare
src and
dest
Kick of
distcp job
Check
success
file (ALL)
Update
DAL
Success
Let other
instance
update
DAL
Need to
copy
Copied
already
Success
Failure
No
Yes
Done
Each Replicator updates partition
independently
Uniform Access
for Users
Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition
Query dataset
$dal logical-dataset list --role hadoop --name logs.partly-cloudy
| 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active |
$dal physical-dataset list --role hadoop --name logs.partly-cloudy
| 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly-
cloudy/yyyy/mm/dd/hh |
| 41065 | http://dalpds/41065 | gcs |
gcs:///logs/partly-cloudy/yyyy/mm/dd/hh |
List all physical locations
Find dataset by logical name
Query partitions of dataset
$dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs
2019-04-01T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/04/01/11
HadoopLzop
2019-04-01T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/04/01/12
HadoopLzop
2019-04-01T13:00:00Z 2019-04-01T14:00:00Z gcs:///logs/partly-cloudy/2019/04/01/13
HadoopLzop
2019-04-01T14:00:00Z 2019-04-01T15:00:00Z gcs:///logs/partly-cloudy/2019/04/01/14
HadoopLzop
2019-04-01T15:00:00Z 2019-04-01T16:00:00Z gcs:///logs/partly-cloudy/2019/04/01/15
HadoopLzop
2019-04-01T16:00:00Z 2019-04-01T17:00:00Z gcs:///logs/partly-cloudy/2019/04/01/16
HadoopLzop
All partitions for dataset on GCS
Monitoring
● Rich set of
monitoring for
Replicator and
replicator configs
● Uniform monitoring
dashboard for
onprem and cloud
replicators
Read/Write bytes per destination
Latency per destination
9. Alerting
● Fine tuned alert configs per metric per
replicator
● Pages on call for critical issues
● Uniform alert dashboard and config for
onprem and cloud replicators
GCP Project ZGCP Project YGCP Project X
Replicators per project
Twitter DataCenter
Copy Cluster
/gcs/dataX/2019/0
4/10/03
/gcs/dataY/2019/0
4/10/03
/gcs/dataZ/2019/04
/10/03
DistcpDistcp
DistcpDistcp DistcpDistcp
Replicator X Replicator Y Replicator Z
Cloud Storage Cloud Storage Cloud Storage
RegEx based path resolution
<property>
<name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name>
<value>gs://logs.${dataset}</value>
</property>
<property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name>
<value>gs://user.${userName}</value>
</property>
/gcs/logs/partly-cloudy/2019/04/10
/gcs/user/lohit/hadoop-stats
gs://logs.partly-cloudy/2019/04/10
gs://user.lohit/hadoop-stats
Twitter ViewFS Path GCS bucket
Twitter ViewFS mounttable.xml
Where are we today
● Tens of instances of GCS
Replicators
● Copied tens of petabytes of data
● Hundreds of thousands of copy
jobs
● Unlocked multiple use cases on
GCP
Made here
together
Twitter + Google
Google Storage Hadoop connector
● Checksum mismatch between Hadoop FileSystem and Google Cloud Storage
○ Composite checksum HDFS-13056
○ More details in blog post*
● Proxy configuration as path
● Per user credentials
● Lazy initialization to support View FileSystem
* https://cloud.google.com/blog/products/storage-data-transfer/new-file-checksum-feature-lets-you-validate-data-transfers-between-hdfs-and-cloud-storage
Performance and Consistency
● Performance optimization uncovered during evaluation Presto on GCP
● Cooperative locking in Google Connector for atomic renames
○ https://github.com/GoogleCloudPlatform/bigdata-interop/tree/cooperative_locking
● Same version of connector (onprem and open source)
Summary
Describe Twitter’s Data Replicator Architecture,
present our solution to extend it to Google Cloud Storage
and maintain consistent interface for users.
Acknowledgement
Ran Wang @RanWang18
Zhenzhao Wang @zhen____w
Joseph Boyd @sluicing
Joep Rottinghuis @joep
Hadoop Team @TwitterHadoop
https://cloud.google.com/twitter
Tweet to @TwitterEng
https://careers.twitter.com
Questions
Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comments
Rate icon in
status bar
Thank you

Contenu connexe

Tendances

AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 

Tendances (20)

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The National Oceanographic Data Center’s NetCDF Templates
The National Oceanographic Data Center’s NetCDF TemplatesThe National Oceanographic Data Center’s NetCDF Templates
The National Oceanographic Data Center’s NetCDF Templates
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 

Similaire à Twitter's Data Replicator for Google Cloud Storage

Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Colvin RMAN New Features
Colvin RMAN New FeaturesColvin RMAN New Features
Colvin RMAN New Features
Enkitec
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
DataWorks Summit
 
040419 san forum
040419 san forum040419 san forum
040419 san forum
Thiru Raja
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
Abhijit Sharma
 

Similaire à Twitter's Data Replicator for Google Cloud Storage (20)

Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”
 
Champion Fas Deduplication
Champion Fas DeduplicationChampion Fas Deduplication
Champion Fas Deduplication
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Colvin RMAN New Features
Colvin RMAN New FeaturesColvin RMAN New Features
Colvin RMAN New Features
 
SQL server Backup Restore Revealed
SQL server Backup Restore RevealedSQL server Backup Restore Revealed
SQL server Backup Restore Revealed
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Disaster Recovery Synapse
Disaster Recovery SynapseDisaster Recovery Synapse
Disaster Recovery Synapse
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
 
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptxThink_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
 
040419 san forum
040419 san forum040419 san forum
040419 san forum
 
Whats new in oracle trace file analyzer 19.2
Whats new in oracle trace file analyzer 19.2Whats new in oracle trace file analyzer 19.2
Whats new in oracle trace file analyzer 19.2
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
 
The DBpedia databus
The DBpedia databusThe DBpedia databus
The DBpedia databus
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
 

Plus de lohitvijayarenu

Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
lohitvijayarenu
 

Plus de lohitvijayarenu (10)

OpenSource and the Cloud ApacheCon.pptx
OpenSource and the Cloud  ApacheCon.pptxOpenSource and the Cloud  ApacheCon.pptx
OpenSource and the Cloud ApacheCon.pptx
 
The Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at TwitterThe Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at Twitter
 
Log Events @Twitter
Log Events @TwitterLog Events @Twitter
Log Events @Twitter
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
 
Scaling event aggregation at twitter
Scaling event aggregation at twitterScaling event aggregation at twitter
Scaling event aggregation at twitter
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
 
Open Source india 2014
Open Source india 2014Open Source india 2014
Open Source india 2014
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
 

Dernier

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Dernier (20)

Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 

Twitter's Data Replicator for Google Cloud Storage

  • 1. DA300: How Twitter Replicates Petabytes of Data to Google Cloud Storage Lohit VijayaRenu, Twitter @lohitvijayarenu
  • 2.
  • 3. Agenda Describe Twitter’s Data Replicator Architecture, present our solution to extend it to Google Cloud Storage and maintain consistent interface for users. Tweet questions #GoogleNext19Twitter
  • 4. Twitter DataCenter Data Infrastructure for Analytics Real Time Cluster Production Cluster Ad hoc Cluster Cold Storage Log Pipeline Micro Services Data Generate > 1.5 Trillion events every day Incoming Storage Produce > 4PB per day Production jobs Process hundreds of PB per day Ad hoc queries Executes tens of thousands of jobs per day Cold/Backup Hundreds of PBs of data Streaming systems
  • 5. Data Infrastructure for Analytics ` Hadoop Cluster Data Access Layer Replication Service Retention Service Hadoop Cluster Replication Service Retention Service
  • 6. Data Access Layer ● Dataset has logical name and one or more physical locations ● Users/Tools such as scalding, presto, HIVE query DAL for available hourly partitions ● Dataset has hourly/daily partitions in DAL ● Also stores various properties such as owner, schema, location with datasets * https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
  • 7. FileSystem abstraction Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html Namespace
  • 8. FileSystem abstraction Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html ClusterZ Namespace 2 Namespace 1
  • 9. FileSystem abstraction Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 2 Namespace 1 DataCenter-1 DataCenter-2
  • 10. FileSystem abstraction Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy Path on Twitter’s HDFS Clusters* : /DataCenter-1/cluster-X/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Replicator
  • 11. DataCenter 2DataCenter 1 Need for Replication Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Thousands of datasets configured for replication ● Across tens of different clusters ● Data kept in sync hourly/daily/snapshot ● Fault tolerant
  • 12. Data Replicator ● Replicator per destination ● 1 : 1 Copy from src to destination ● N : 1 Copy + Merge from multiple src to destination ● Publish to DAL upon completion Copy Source Cluster Destination Cluster Replicator Copy + Merge Source Cluster Destination Cluster Replicator Source Cluster
  • 13. Dataset : partly-cloudy Src Cluster : ClusterX Src path : /logs/partly-cloudy Dest Cluster : ClusterY Dest path : /logs/partly-cloudy Copy Since : 3 days Owner : hadoop-team Replication setup Data Access Layer Replicator Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  • 14. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  • 15. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy + Merge Source Cluster /ClusterX-2/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX-1/logs/partly-cloudy /ClusterX-2/logs/partly-cloudy /ClusterY/logs/partly-cloudy Type : Multiple Src Source Cluster /ClusterX-1/logs/partly-cloudy/ 2019/04/10/03 Distcp 2019/04/10/03 Merge
  • 16. Extending Replication to GCS DataCenter 2DataCenter 1 Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools Cloud Storage
  • 17. Extending Replication to GCS DataCenter 1 Hadoop Cluster BigQuery GCE VMs ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools Cloud Storage
  • 18. Bucket on GCS : gs://logs.partly-cloudy View FileSystem and Google Hadoop Connector Cloud Storage
  • 19. Bucket on GCS : gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy View FileSystem and Google Hadoop Connector Cloud Storage Connector Cloud Storage
  • 20. Bucket on GCS : gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy Twitter Resolved Path : /gcs/logs/partly-cloudy View FileSystem and Google Hadoop Connector Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Cloud Storage Connector Replicator Cloud Storage
  • 21. Twitter DataCenter Architecture behind GCS replication Copy Cluster GCS /gcs/logs/partly-cloud /2019/04/10/03 Replicator : GCS DAL Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Distcp Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /gcs/logs/partly-cloudy
  • 22. Twitter DataCenter Network setup for copy Twitter & Google private peering (PNI) Copy Cluster GCS /gcs/logs/partly- cloudy/2019/04/ 10/03 Distcp Replicator : GCS Proxy group
  • 23. Merge same dataset on GCS (Multi Region Bucket) Twitter DataCenter X-2 Copy Cluster X-2 /gcs/logs/partly- cloudy/2019/04/ 10/03 Source ClusterX-2 /ClusterX-2/logs/partly- cloudy//2019/04/10/03 Twitter DataCenter X-1 Copy Cluster X-1Source ClusterX-1 /ClusterX-1/logs/partly- cloudy/2019/04/10/03 Distcp Multi Region Bucket Distcp Cloud Storage
  • 24. Merging and updating DAL ● Multiple Replicators copy same dataset partition to destination ● Each of Replicator checks for availability of data independently ● Creates individual _SUCCESS_<SRC> files ● Updates DAL when all _SUCCESS_<SRC> are found ● Updates are idempotent Compare src and dest Kick of distcp job Check success file (ALL) Update DAL Success Let other instance update DAL Need to copy Copied already Success Failure No Yes Done Each Replicator updates partition independently
  • 26. Dataset via EagleEye ● View different destination for same dataset ● GCS is another destination ● Also shows delay for each hourly partition
  • 27. Query dataset $dal logical-dataset list --role hadoop --name logs.partly-cloudy | 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active | $dal physical-dataset list --role hadoop --name logs.partly-cloudy | 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly- cloudy/yyyy/mm/dd/hh | | 41065 | http://dalpds/41065 | gcs | gcs:///logs/partly-cloudy/yyyy/mm/dd/hh | List all physical locations Find dataset by logical name
  • 28. Query partitions of dataset $dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs 2019-04-01T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/04/01/11 HadoopLzop 2019-04-01T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/04/01/12 HadoopLzop 2019-04-01T13:00:00Z 2019-04-01T14:00:00Z gcs:///logs/partly-cloudy/2019/04/01/13 HadoopLzop 2019-04-01T14:00:00Z 2019-04-01T15:00:00Z gcs:///logs/partly-cloudy/2019/04/01/14 HadoopLzop 2019-04-01T15:00:00Z 2019-04-01T16:00:00Z gcs:///logs/partly-cloudy/2019/04/01/15 HadoopLzop 2019-04-01T16:00:00Z 2019-04-01T17:00:00Z gcs:///logs/partly-cloudy/2019/04/01/16 HadoopLzop All partitions for dataset on GCS
  • 29. Monitoring ● Rich set of monitoring for Replicator and replicator configs ● Uniform monitoring dashboard for onprem and cloud replicators Read/Write bytes per destination Latency per destination
  • 30. 9. Alerting ● Fine tuned alert configs per metric per replicator ● Pages on call for critical issues ● Uniform alert dashboard and config for onprem and cloud replicators
  • 31. GCP Project ZGCP Project YGCP Project X Replicators per project Twitter DataCenter Copy Cluster /gcs/dataX/2019/0 4/10/03 /gcs/dataY/2019/0 4/10/03 /gcs/dataZ/2019/04 /10/03 DistcpDistcp DistcpDistcp DistcpDistcp Replicator X Replicator Y Replicator Z Cloud Storage Cloud Storage Cloud Storage
  • 32. RegEx based path resolution <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name> <value>gs://logs.${dataset}</value> </property> <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name> <value>gs://user.${userName}</value> </property> /gcs/logs/partly-cloudy/2019/04/10 /gcs/user/lohit/hadoop-stats gs://logs.partly-cloudy/2019/04/10 gs://user.lohit/hadoop-stats Twitter ViewFS Path GCS bucket Twitter ViewFS mounttable.xml
  • 33. Where are we today ● Tens of instances of GCS Replicators ● Copied tens of petabytes of data ● Hundreds of thousands of copy jobs ● Unlocked multiple use cases on GCP
  • 35. Google Storage Hadoop connector ● Checksum mismatch between Hadoop FileSystem and Google Cloud Storage ○ Composite checksum HDFS-13056 ○ More details in blog post* ● Proxy configuration as path ● Per user credentials ● Lazy initialization to support View FileSystem * https://cloud.google.com/blog/products/storage-data-transfer/new-file-checksum-feature-lets-you-validate-data-transfers-between-hdfs-and-cloud-storage
  • 36. Performance and Consistency ● Performance optimization uncovered during evaluation Presto on GCP ● Cooperative locking in Google Connector for atomic renames ○ https://github.com/GoogleCloudPlatform/bigdata-interop/tree/cooperative_locking ● Same version of connector (onprem and open source)
  • 37. Summary Describe Twitter’s Data Replicator Architecture, present our solution to extend it to Google Cloud Storage and maintain consistent interface for users.
  • 38. Acknowledgement Ran Wang @RanWang18 Zhenzhao Wang @zhen____w Joseph Boyd @sluicing Joep Rottinghuis @joep Hadoop Team @TwitterHadoop https://cloud.google.com/twitter
  • 40. Your Feedback is Greatly Appreciated! Complete the session survey in mobile app 1-5 star rating system Open field for comments Rate icon in status bar

Notes de l'éditeur

  1. Twitter’s Data Replicator for GCS at GoogleNext 2019. Lohit VijayaRenu, Twitter
  2. Data is identified by a dataset name HDFS is the primary storage for Analytics Users configure replication rules for different clusters Dataset also has retention rules defined per cluster Dataset are always represented on fixed interval partitions (hourly/daily) Dataset is defined in system called Data Access Layer (DAL)* Data is made available at different destination using Replicator
  3. All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  4. All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  5. All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  6. All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  7. Run one Replicator per destination cluster Always pull model Fault tolerant 1:1 or N:1 setup Upon copy, publish to DAL
  8. Users setup replication entry one per Dataset with properties Source and Destination clusters Copy since X days. (Optionally copy until Y days) Owner, team, contact email Copy job configuration Different ways to specify configuration. yml , DAL, configdb. Configure contact email for alerts. Fault tolerant copy to keep data in sync.
  9. Long running daemon (on mesos) Daemon checks configuration and schedules copy on hourly partition Copy jobs are executed as Hadoop distcp jobs Jobs are on destination cluster After hourly copy, publish partition to DAL
  10. Some datasets are collected across multiple DataCenters Replicator would kick off multiple DistCP jobs to copy at tmp location Replicator then merges dataset into single directory and does atomic rename to final destination Renames on HDFS are cheap and atomic, which makes this operation easy
  11. Use same Replicator code to sync data to GCS Utilize ViewFileSystem abstraction to hide GCS /gcs/dataset/2019/04/10/03 maps to gs://dataset.bucket/2019/04/10/03 Use Google Hadoop Connector to interact with GCS using Hadoop APIs Distcp jobs runs on dedicated Copy cluster Create ViewFileSystem mount point on Copy cluster to fake GCS destination Distcp tasks stream data from source HDFS to GCS (no local copy)
  12. Replicator Daemon uses proxy While actual data flow directly to GCP from Twitter PNI setup between Twitter and Google
  13. Data for same dataset is aggregated at multiple DataCenters (DC x and DC y) Replicators in each DC schedules individual DistCp jobs Data from multiple DC ends up under same path on GCS
  14. UI support via EagleEye to view all replication configurations Properties associated with configuration. Src, dest, owner, email, etc… CLI support to manage replication configurations Load new or modify existing configuration List all configurations Mark active/inactive configurations API support for clients and replicators Rich set of api access for all above operations
  15. Command line tools dal command line to look up datasets, destination and available partitions API access to DAL Scalding/Presto query DAL to check partitions for time range Jobs also link to scheduler which can kick off jobs based on new partitions UI access EagleEye is UI to view details about Datasets and also available partitions Can also who delay per hourly partitions Uniform access on prem or cloud Interface to dataset properties are same on prem or cloud
  16. GCP Projects are based on organization Deploy separate Replicator with its own credentials per project Shared copy cluster per DataCenter Enables independent updates and reduces risk of errors
  17. Logs vs user path resolution Projects and buckets have standard naming convention Logs at : gs://logs.<category name>.twttr.net/ User data at gs://user.<user name>.twtter.net/ Access to these buckets are via standard path Logs at /gcs/logs/<category name>/ User data at /gcs/user/<user name>/ Typically we need mapping of path prefix to bucket name in Hadoop ViewFileSystem mountable.xml Modified ViewFileSystem to dynamically create mountable mapping on demand since bucket name and path name are standard No configuration or update needed
  18. Google Cloud Storage connector to access gcs Existing applications using Hadoop FileSystem APIs continue to work Existing tools continue to work against GCS For most part users do not know there is separate tool / API to access GCS Users use commands such as hadoop fs -ls /gcs/user/larry/my_dataset/2019/01/04 hadoop fs -du -s -h /gcs/user/larry/my_dataset hadoop fs -put /gcs/user/larry/my_dataset/2019/01/04/file.txt ./file.txt Hadoop Cloud Storage connector is installed along with hadoop client on jump hosts and hadoop nodes Applications can also package connector jar
  19. Google supports data at petabyte scale, securely with our best in class analytics and machine learning capabilities to inform real-time decisions and coordinate response on the roads.