This presentation describes the journey taken by the Hotels.com big data platform team when tasked with migrating big data sets and pipelines from on-premises clusters to cloud based platforms. We present two open source tools that we built to overcome the unexpected challenges we faced.
The first of these is Circus Train—a dataset replication tool that copies Hive tables between clusters and clouds. We will also discuss various other options for dataset replication and what unique features Circus train has. The second tool is Waggle Dance—a federated Hive query service that enables querying of data stored across multiple Hive metastores. We will demonstrate the differences between Waggle Dance and existing federated SQL query engine tools and what use cases it enables. Giving real world examples, we will describe how we've used these tools to successfully build a petabyte scale platform that is now also being used by other brands within the Expedia organisation. We focus on actual problems and solutions that have arisen in a huge, organically grown corporation, rather than idealised architectures.
Speakers
Adrian Woodhead, Principal Engineer, Hotels.com
Elliot West, Senior Engineer, Hotels.com
2. Agenda
Cloud migration plan
Circus Train:Atool for data set replication
Cross account data sharing
Waggle Dance:Atool for query federation
Apiary: Data sharing pattern
2
9. • Considered multiple open source and proprietary solutions, but…
• Minimal/no support for replicating metadata
• Lacking support for data consistency
• Required “root” access to Hive metastore
• Reluctantly created our own to unblock cloud migration
Finding a solution
9
16. • Core end user data sets constantly replicated in cloud
• In production for over 2 years
• Well over 1PB of data replicated
• Widely used across Expedia
• Good progress moving analysts and data scientists over to cloud
• Engineering teams now have more time to migrate their jobs
• Contributed to open source: goo.gl/byPXNp
Circus Train: Now
16
23. Metadata silos: User perspective
23
hive>showdatabases;
default
etl
Main account Data Science account
default
ml
24. Metadata silos: User perspective
24
hive>describeformatted
>etl.hotel;
#col_name data_type
id int
name string
#Location: s3://hcom-main/etl/hotel/
Main account Data Science account
hive>describeformatted
>ml.hero_image;
#col_name data_type
id int
hotel_id int
img_name string
#Location: s3://hcom-datasci/ml/hero/
25. Metadata federation across accounts
25
hive>sethive.metastore.uris
>=thrift://waggle-dance:48869;
Data Science account (with federated HMS)
hive>showdatabases;
default
etl
ml
27. • Solutions:
• Users operating in the cloud
• Methods for migrating datasets
• Ability to share datasets
• Problems:
• Adhoc deployments of Circus Train and Waggle Dance
• Operationally complex: networking, security
Our journey to this point
27
31. By embracing the topology of our platform, and the use of our tools, we were
free to consider the opportunities.
Data Sharing Pattern: Retrospective
31
Then Now
Inter-region data
replication
Operational
burden
Disaster
Recovery
Segmented
accounts
Inconvenient
boundary
Architectural
primitive
32. Our problems may not be yours
• Simple solutions can work well
• Hive CTAS (Replication / Migration)
• Monolithic metastore (Data sharing and discovery)
• Platform consistency allows the adoption of standard solutions
• Unified account
• Curated toolsets
• Standard patterns and conventions
Your journey
32
35. • Databases are the only unit of federation
• Every metastore has a default database
• Common database names: etl, lz
• How to combine homonymous databases?
• How to circumvent the problem entirely?
Problem: Overloading Database Namespace
35
36. • Default behaviour
• Federated databases curated by white-list expressions:
[ml,…],[hcom_*,ean_*],[*]
• Works well combined with a global database naming standard:
${brand}_${account}_${dbname}→hcom_datasci_ml
• Limitation: Name overloads
• On starting: Fail fast
• When running: Ignore/mask
Database Namespace: Manual mode
36
37. • Prefix string provided for each federated metastore (not primary)
• Prefix applied to all databases in a given metastore
• Exploration: Provides access to all databases, no overloads
• Limitation: Scripts are not portable across accounts
Database Namespace: Prefixed mode
37
Metastore Prefix Database Mapped name
primary etl etl
datasci ds_ etl ds_etl
analytics al_ default al_default
38. Data access rules
38
Local
Account
Cross
Account
Local
Region
Cross
Region
Process
Write ✓ x ✓ x
Process
Read ✓ ✓ ✓ x
Replicate x x x ✓
Local
Account
Cross
Account
Local
Region
Cross
Region
Process
Write ✓ x ✓ x
Process
Read ✓ ✓ ✓ x
Replicate x x x ✓
Local
Account
Cross
Account
Local
Region
Cross
Region
Process
Write ✓ x ✓ x
Process
Read ✓ ✓ ✓ x
Replicate x x x ✓
Local
Account
Cross
Account
Local
Region
Cross
Region
Process
Write ✓ x ✓ x
Process
Read ✓ ✓ ✓ x
Replicate x x x ✓
39. • The needs of our platform are always changing
• Keen to explore other approaches
• Iceberg (https://github.com/Netflix/iceberg)
• Amazon Glue (https://aws.amazon.com/glue/)
Evolution
39
40. Metadata silos: User perspective
40
hive>sethive.metastore.uris;
thrift://
hcom-main-metastore:9083
Main account Data Science account
thrift://
hcom-datasci-metastore:9083
41. • Security model: unwieldy, indiscriminate
– Enforced at networking and account layers
• Tool integration issues: not everything goes via Hive Thrift API
– Amazon Glue (uses MetaStoreClient API)
– Qubole UI ’Explore’ pane (uses JDBC → Metastore DB)
Data Sharing Pattern: Problems
41