HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

HBaseCon 2012

Applications Track – Case Study

1

 Suraj Varma
 Director of Technology Implementation
 Gap Inc Direct (GID), San Francisco, CA
 IRC: svarma

 Gupta Gogula
 Director-IT & Domain Architect of Catalog
Management & Distribution
 Gap Inc Direct (GID), San Francisco, CA
2

 Problem Domain

 HBase Schema Specifics

 HBase Cluster Specifics

 Learning & Challenges

3

2009
2008
2005
2007
2010 APPLICATION SERVERS DATABASES

NEWATHLETA
CA & SITE LAUNCH
UNIVERSALITY
PIPERLIME
EU MARKETS
US
CA EU

US
CA EU
INCOMING TRAFFIC
US
CA EU

US

US

4

 Evolution of the GID Apparel Catalog
 2005 - Three independent brands in US
 2010 – 5 integrated brands in US, CA, EU

 Rapid Expansion of Apparel Catalog

 However, each brand / market combination
necessitated separate logical catalog
databases
5

 Single Catalog store for all brands/markets
 Horizontally scalable over time
 Cross brand business features

 Access data store directly
 To avail of inventory awareness of items

 Minimal Caching – only for optimization
 Keeping caches in sync is a problem.

 Highly Available
6

 Sharded RDMBS, MemCached, etc
 Significant effort was required
 Still had scalability limits

 Non-relational alternatives considered

 HBase POC (early-2010)
 Promising results -decided to move ahead

7

 Strong Consistency Model
 Server Side Filters
 Automatic Sharding, Distribution, Failover
 Hadoop Integration out of the box

 General Purpose
 Other use cases outside of Catalog

 Strong Community!
8

NEAR REAL TIME INVENTORY UPDATES

MUTATIONS

INCOMING
REQUESTS BACKEND HBASE
FOR SERVICES REQUESTS CLUSTER
CATALOG DATA

MUTATIONS MUTATIONS

PRICING UPDATES ITEM UPDATES
9

 Read Mostly  Website Traffic
 Sync MR Jobs

 Write / Delete Bursts  Catalog Publish
 Phase out to near real-
time updates from
originating systems
 MR jobs on Live Cluster

 Continuous Writes  Inventory Updates

10

 Hierarchical Data (Primarily)
 SKU -> Style Lookups (child -> parent)
 Cross Brand Sell (sibling <-> sibling) Rows:
100KB avg size
1000-5000 cols
 Data Access Patterns Sparse rows
 Full Product Graph in one read
 Single path of graph from root to leaf node
 Search - Secondary Indices
 Large Feed files

11

READ FULL GRAPH

READ SINGLE PATH / EDGE

12

 Built custom “bean to schema mapper”
 POJO graph < -> HBase qualifiers
 Flexibility to shorten column qualifiers
 Flexibility to change schema qualifiers (per
environment / developer)
<…>
<association>one-to-many</association>
<prefix>SC</prefix>
<uniqueId>colorCd</uniqueId>
<beanName>styleColorBean</beanName>
<…>
13

 <PP>_<id1>_QQ_<id2>_RR_<id3>_name
 Where PP is parent, QQ is child, RR is grandchild

Pattern: ANCESTOR IDS EMBEDDED IN QUALIFIER NAME

cf1:VAR_1_SC_0012_colorCd
cf2:VAR_1_SC_0012_SCIMG_10_path

14

 Secondary Index
 <id3> => RR ; QQ ; PP
 FilterList with (RR, QQ, PP) ids to get thin slice
path

Pattern: SECONDARY INDEX TO HIERARCHICAL ANCESTORS
KEY_5555 4444 333 22 1

15

 “Publish at Midnight”
 Future Dated PUTs
 Get/Scan with time range

 Large Feed Files
 Sharded into smaller chunks < 2MB per cell

Pattern: SHARDED CHUNKS
KEY_nnnn S_1 S_2 S_3 S_4

16

 16 Slave (RS + TT + DN) Nodes
 8 & 16 GB RAM

 3 Master (HM,ZK,JT, NN) Nodes
 8 GB RAM

 NN Failover via NFS

17

 Block Cache
 Maximize Block Cache
 hfile.block.cache.size: 0.6

 Garbage Collection
 MSLAB enabled
 CMSInitiatingOccupancyFactor

18

 Quick Recovery on node failure
 Default timeouts too large
 zookeeper.session.timeout

 Region Server
 hbase.rpc.timeout

 Data Node
 dfs.heartbeat.recheck.interval
 heartbeat.recheck.interval

19

 Block Cache Size Tuning
 Block Cache Churn

 Hot Row scenarios
 Perf Tests & Doing Phased Rollouts

 Hot Region issues
 Perf Tests & Pre-split Regions.

 Filters
 CPU Intensive – profiling needed.

20

 Monitoring is crucial
 Layer by layer -> what’s the bottleneck
 Metrics to target optimization & tuning
 Troubleshooting

 Non Uniform Hardware
 Sub-optimal region distribution
 Hefty boxes lightly loaded.

21

 M/R Jobs running on live cluster
 Has an impact – so cannot run full throttle
 Go easy …

 Feature Enablement – Phase in
 Don’t turn on several features together
 Easier identification of potential hot regions /
rows, overloaded RS, etc

22

INVENTORY UPDATES
FEATURE “A” ENABLED:
ADDITIONAL “N” REQ / SEC

INCOMING BACKEND LOT
HBASE
REQUESTS SERVICES MORE CLUSTER
REQUESTS
FEATURE “B” ENABLED:
ADDITIONAL “K” REQ / SEC
PRICING UPDATES ITEM UPDATES

Enable Features individually to measure impact and tune cluster accordingly
23

 Search
 No out-of-the-box secondary indexes.
 Custom solution with Solr

 Transactions
 Only row level atomicity
 But … can’t pack all in a single row
 Atomic Cross-Row Put/Delete and HBASE-5229
seem potential partial solves (0.94+)
24

 Orthogonal access patterns
 Optimize for most frequently used pattern.

 Filters
 May suffice, with early out configurations
 Impacts CPU usage

 Duplicate data for every access pattern
 Too drastic
 Effort to keep all copies in sync

25

 Rebuild from source data
 Takes time … but no data loss

 Export / import based backups
 Faster … but stale
 Another MR on live cluster

 Better options in future releases …

26

We’re hiring!
http://www.gapinc.com

27

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (12)

Similaire à HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

Similaire à HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website (20)

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Dernier

Dernier (20)

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website