Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as well as key decisions regarding schema selection and cluster configurations. We will also discuss implementation challenges and insights that were learned.
2. Suraj Varma
Director of Technology Implementation
Gap Inc Direct (GID), San Francisco, CA
IRC: svarma
Gupta Gogula
Director-IT & Domain Architect of Catalog
Management & Distribution
Gap Inc Direct (GID), San Francisco, CA
2
4. 2009
2008
2005
2007
2010 APPLICATION SERVERS DATABASES
NEWATHLETA
CA & SITE LAUNCH
UNIVERSALITY
PIPERLIME
EU MARKETS
US
CA EU
US
CA EU
INCOMING TRAFFIC
US
CA EU
US
US
4
5. Evolution of the GID Apparel Catalog
2005 - Three independent brands in US
2010 – 5 integrated brands in US, CA, EU
Rapid Expansion of Apparel Catalog
However, each brand / market combination
necessitated separate logical catalog
databases
5
6. Single Catalog store for all brands/markets
Horizontally scalable over time
Cross brand business features
Access data store directly
To avail of inventory awareness of items
Minimal Caching – only for optimization
Keeping caches in sync is a problem.
Highly Available
6
7. Sharded RDMBS, MemCached, etc
Significant effort was required
Still had scalability limits
Non-relational alternatives considered
HBase POC (early-2010)
Promising results -decided to move ahead
7
8. Strong Consistency Model
Server Side Filters
Automatic Sharding, Distribution, Failover
Hadoop Integration out of the box
General Purpose
Other use cases outside of Catalog
Strong Community!
8
9. NEAR REAL TIME INVENTORY UPDATES
MUTATIONS
INCOMING
REQUESTS BACKEND HBASE
FOR SERVICES REQUESTS CLUSTER
CATALOG DATA
MUTATIONS MUTATIONS
PRICING UPDATES ITEM UPDATES
9
10. Read Mostly Website Traffic
Sync MR Jobs
Write / Delete Bursts Catalog Publish
Phase out to near real-
time updates from
originating systems
MR jobs on Live Cluster
Continuous Writes Inventory Updates
10
11. Hierarchical Data (Primarily)
SKU -> Style Lookups (child -> parent)
Cross Brand Sell (sibling <-> sibling) Rows:
100KB avg size
1000-5000 cols
Data Access Patterns Sparse rows
Full Product Graph in one read
Single path of graph from root to leaf node
Search - Secondary Indices
Large Feed files
11
13. Built custom “bean to schema mapper”
POJO graph < -> HBase qualifiers
Flexibility to shorten column qualifiers
Flexibility to change schema qualifiers (per
environment / developer)
<…>
<association>one-to-many</association>
<prefix>SC</prefix>
<uniqueId>colorCd</uniqueId>
<beanName>styleColorBean</beanName>
<…>
13
14. <PP>_<id1>_QQ_<id2>_RR_<id3>_name
Where PP is parent, QQ is child, RR is grandchild
Pattern: ANCESTOR IDS EMBEDDED IN QUALIFIER NAME
cf1:VAR_1_SC_0012_colorCd
cf2:VAR_1_SC_0012_SCIMG_10_path
14
15. Secondary Index
<id3> => RR ; QQ ; PP
FilterList with (RR, QQ, PP) ids to get thin slice
path
Pattern: SECONDARY INDEX TO HIERARCHICAL ANCESTORS
KEY_5555 4444 333 22 1
15
16. “Publish at Midnight”
Future Dated PUTs
Get/Scan with time range
Large Feed Files
Sharded into smaller chunks < 2MB per cell
Pattern: SHARDED CHUNKS
KEY_nnnn S_1 S_2 S_3 S_4
16
19. Quick Recovery on node failure
Default timeouts too large
zookeeper.session.timeout
Region Server
hbase.rpc.timeout
Data Node
dfs.heartbeat.recheck.interval
heartbeat.recheck.interval
19
20. Block Cache Size Tuning
Block Cache Churn
Hot Row scenarios
Perf Tests & Doing Phased Rollouts
Hot Region issues
Perf Tests & Pre-split Regions.
Filters
CPU Intensive – profiling needed.
20
21. Monitoring is crucial
Layer by layer -> what’s the bottleneck
Metrics to target optimization & tuning
Troubleshooting
Non Uniform Hardware
Sub-optimal region distribution
Hefty boxes lightly loaded.
21
22. M/R Jobs running on live cluster
Has an impact – so cannot run full throttle
Go easy …
Feature Enablement – Phase in
Don’t turn on several features together
Easier identification of potential hot regions /
rows, overloaded RS, etc
22
23. INVENTORY UPDATES
FEATURE “A” ENABLED:
ADDITIONAL “N” REQ / SEC
INCOMING BACKEND LOT
HBASE
REQUESTS SERVICES MORE CLUSTER
REQUESTS
FEATURE “B” ENABLED:
ADDITIONAL “K” REQ / SEC
PRICING UPDATES ITEM UPDATES
Enable Features individually to measure impact and tune cluster accordingly
23
24. Search
No out-of-the-box secondary indexes.
Custom solution with Solr
Transactions
Only row level atomicity
But … can’t pack all in a single row
Atomic Cross-Row Put/Delete and HBASE-5229
seem potential partial solves (0.94+)
24
25. Orthogonal access patterns
Optimize for most frequently used pattern.
Filters
May suffice, with early out configurations
Impacts CPU usage
Duplicate data for every access pattern
Too drastic
Effort to keep all copies in sync
25
26. Rebuild from source data
Takes time … but no data loss
Export / import based backups
Faster … but stale
Another MR on live cluster
Better options in future releases …
26