Tools and approaches for migrating big datasets to the cloud

Tools and approaches for
migrating big datasets to the
cloud
DW4

Agenda
Cloud migration plan
Circus Train:Atool for data set replication
Cross account data sharing
Waggle Dance:Atool for query federation
Apiary: Data sharing pattern
2

5
2PB
1000s jobs
Hive+HDFS
Many tools

Migration plan
6
Jobs Datasets Users
Replication

7
Years
Incremental
Continuous
data replication

Dataset replication problem
8
1
2
Requirements
• Data
• Metadata
• Co-ordinated
• Consistent

• Considered multiple open source and proprietary solutions, but…
• Minimal/no support for replicating metadata
• Lacking support for data consistency
• Required “root” access to Hive metastore
• Reluctantly created our own to unblock cloud migration
Finding a solution
9

Circus Train
10
SOURCE(ONPREM)
REPLICA(AWS)
2
1
3
4
5
NOTIFY
TRANSFORM META
READ
META
READ
DATA
WRITE SNAPSHOT

source-catalog:
hive-metastore-uris: thrift://on-prem-metastore:9083
replica-catalog:
hive-metastore-uris: thrift://aws-metastore:9083
table-replications:
source-table:
database-name: etl
table-name: clickstream
partition-filter:
ldate >= '#{#nowUTC().minusDays(1).toString("yyyy-MM-dd")}'
replica-table:
table-location: s3://hcom-main/hive/etl/clickstream/
sns-event-listener:
topic: arn:aws:sns:us-west-2:circus-train-events
11

Circus Train: Optimised Copiers
12
Target File Store
HDFS S3 GCS
SourceFileStore
HDFS distcp HFS → S3 copier distcp
S3 distcp S3 → S3 copier * distcp
GCS distcp HFS → S3 copier distcp
GBQ - GBQ → S3 copier -
* Replication cluster not requiredHFS: Hadoop FileSystem implementations

Circus Train: Selective Data Replication
13
Partition filter expressions (SpEL)
ldate >= '#{#nowUTC().minusDays(1).
toString("yyyy-MM-dd")}'
Hive Diff
Detects changes in partition and file metadata

Circus Train: Snapshot Isolation
s3://…/snapshot_1
/snapshot_2
/snapshot_3
Location:
ETL /
Query /
Job
14
s3://…/snapshot_1
Housekeeping

Circus Train: Other features…
15
Plugin Architecture
Notifications
Pluggable Metrics
Copying modes
Views
Transforms
SSH Tunnelling
Key stores

• Core end user data sets constantly replicated in cloud
• In production for over 2 years
• Well over 1PB of data replicated
• Widely used across Expedia
• Good progress moving analysts and data scientists over to cloud
• Engineering teams now have more time to migrate their jobs
• Contributed to open source: goo.gl/byPXNp
Circus Train: Now
16

The Cloud
17
Aplatform for new problems

Multi-account Hive
18
HCOM (MAIN)
DATA_SCI
EXPEDIA
→ Data silos

Consolidate accounts and/or infrastructure?
19
MONOLITH

Replicate between accounts?
20
HCOM (MAIN)
DATA_SCI
EXPEDIA

21
Federation: Autonomy + Collaboration
HCOM (MAIN)
DATA_SCI
EXPEDIA

22
Waggle Dance: Overview
Workload
Workload
DATA_SCI
MAIN
“Primary”
Metastore
Federated
Metastore(s)
Thrift API
US_WEST_2
Waggle Dance

Metadata silos: User perspective
23
hive>showdatabases;
default
etl
Main account Data Science account
default
ml

24
hive>describeformatted
>etl.hotel;
#col_name data_type
id int
name string
#Location: s3://hcom-main/etl/hotel/
hive>describeformatted
>ml.hero_image;
#col_name data_type
id int
hotel_id int
img_name string
#Location: s3://hcom-datasci/ml/hero/

Metadata federation across accounts
25
hive>sethive.metastore.uris
>=thrift://waggle-dance:48869;
Data Science account (with federated HMS)
hive>showdatabases;
default
etl
ml

Hive join across multiple accounts
26
Data Science account (with federated HMS)
hive>selecth.id,h.name,i.img_name
>frometl.hotelh
>joinml.hero_imagei
>whereh.id=i.hotel_id
>andh.namelike"Estrel%";
h.id h.name i.img_name
2314 EstrelHotel 0a673kZVrt832.png
Fetched:1row(s)
hive>

• Solutions:
• Users operating in the cloud
• Methods for migrating datasets
• Ability to share datasets
• Problems:
• Adhoc deployments of Circus Train and Waggle Dance
• Operationally complex: networking, security
Our journey to this point
27

Creating a pattern
28
Adding respectability to workarounds

29
Cross region
replication
Cross
account
sharing
Local
Workloads
R/W
Metastore
service
R/O
Metastore
service
DB
ACCOUNT
APIARY

Data Sharing Pattern: Applied
30
Federate
Replicate
Replicate
HCOM_DATA_SCIHCOM_MAIN EXPEDIA
US_WEST_2
US_EAST_1
US_WEST_2
US_EAST_1

By embracing the topology of our platform, and the use of our tools, we were
free to consider the opportunities.
Data Sharing Pattern: Retrospective
31
Then Now
Inter-region data
replication
Operational
burden
Disaster
Recovery
Segmented
accounts
Inconvenient
boundary
Architectural
primitive

Our problems may not be yours
• Simple solutions can work well
• Hive CTAS (Replication / Migration)
• Monolithic metastore (Data sharing and discovery)
• Platform consistency allows the adoption of standard solutions
• Unified account
• Curated toolsets
• Standard patterns and conventions
Your journey
32

33
https://github.com/HotelsDotCom/circus-train
https://github.com/HotelsDotCom/waggle-dance

• Databases are the only unit of federation
• Every metastore has a default database
• Common database names: etl, lz
• How to combine homonymous databases?
• How to circumvent the problem entirely?
Problem: Overloading Database Namespace
35

• Default behaviour
• Federated databases curated by white-list expressions:
[ml,…],[hcom_*,ean_*],[*]
• Works well combined with a global database naming standard:
${brand}_${account}_${dbname}→hcom_datasci_ml
• Limitation: Name overloads
• On starting: Fail fast
• When running: Ignore/mask
Database Namespace: Manual mode
36

• Prefix string provided for each federated metastore (not primary)
• Prefix applied to all databases in a given metastore
• Exploration: Provides access to all databases, no overloads
• Limitation: Scripts are not portable across accounts
Database Namespace: Prefixed mode
37
Metastore Prefix Database Mapped name
primary etl etl
datasci ds_ etl ds_etl
analytics al_ default al_default

Data access rules
38
Local
Account
Cross
Account
Local
Region
Cross
Region
Process
Write ✓ x ✓ x
Process
Read ✓ ✓ ✓ x
Replicate x x x ✓
Local
Account
Cross
Account
Local
Region
Cross
Region
Process
Write ✓ x ✓ x
Process
Read ✓ ✓ ✓ x
Replicate x x x ✓
Local
Account
Cross
Account
Local
Region
Cross
Region
Process
Write ✓ x ✓ x
Process
Read ✓ ✓ ✓ x
Replicate x x x ✓
Local
Account
Cross
Account
Local
Region
Cross
Region
Process
Write ✓ x ✓ x
Process
Read ✓ ✓ ✓ x
Replicate x x x ✓

• The needs of our platform are always changing
• Keen to explore other approaches
• Iceberg (https://github.com/Netflix/iceberg)
• Amazon Glue (https://aws.amazon.com/glue/)
Evolution
39

40
hive>sethive.metastore.uris;
thrift://
hcom-main-metastore:9083
thrift://
hcom-datasci-metastore:9083

• Security model: unwieldy, indiscriminate
– Enforced at networking and account layers
• Tool integration issues: not everything goes via Hive Thrift API
– Amazon Glue (uses MetaStoreClient API)
– Qubole UI ’Explore’ pane (uses JDBC → Metastore DB)
Data Sharing Pattern: Problems
41

Tools and approaches for migrating big datasets to the cloud

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Tools and approaches for migrating big datasets to the cloud

Similaire à Tools and approaches for migrating big datasets to the cloud (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Tools and approaches for migrating big datasets to the cloud