Presto Bangalore Meetup1 Repertoire@Myntra

Repertoire
Myntra’s Data Serving Platform
Repertoire
Deepak Batra
Nishant Sharma
Rijo Joseph

Repertoire
Myntra - What do we do?
● Started as a customisation company in 2007, Myntra is
largest fashion e-tailer in India today 1, 2
● In 2016, acquired Jabong, to become India’s largest fashion
platform
● 60M+ app downloads for Myntra and Jabong apps on
Google Play store
● Myntra+Jabong list over 9M items for sale
● EORS (end of reason sale) is flagship sale event with over
2.8M orders, which are fulfilled within 7 days
● Focus on innovation with AI, AR/VR and omni-channel
based products
1 Estimated Based on publicly available numbers, Research reports for FY 2018
2 For core Fashion categories of Apparel, Accessories & footwear for FY 2018
FY18 online share in fashion 2

Repertoire
Tech at Myntra
● Key tech focus areas for Myntra + Jabong
○ Storefront: apps and web platform
○ Supply chain: end-to-end inventory & order management
○ Data tech: powering data based insights and intelligent automation in all business
areas
● Data tech covers all the sources and consumers of data within Myntra+Jabong
● Data sources include
○ Streaming data from apps and IoT devices
○ Content and campaign management systems for storefront apps
○ Transactional data from supply chain systems like order management (OMS),
warehouse and logistics management (WMS and LMS)
● Data is processed and served in both realtime and batch modes
● Consumers of data include reports/dashboards, tech products and data science
models

Repertoire
Challenges
● Tiered SLAs
○ Low Latency Data serving
○ What to cache & How to cache
● Compute
○ Roll-ups and drill-downs on the fly
● Multi-modal
○ Support for Key-value, SQL type queries

Repertoire
Challenges
● Query Triaging
○ Execution based on SLAs
● Low Latency Ingestion
○ Low ingestion overhead for real time & batch data
● Fault tolerance and NFRs
○ Availability, Horizontally Scalable, Isolation

Repertoire
Open Source Solutions
● Apache Ignite
○ Pros:
■ Indexes
■ Disk backed Cache
○ Cons:
■ Batch Ingestion
■ Uncompressed Data in Cache
● Presto on S3
○ Pros:
■ Stability
■ Out of the Box
○ Cons:
■ No data co-locality
■ Movement to Azure
● Spark on Alluxio
○ Pros:
■ Data co-locality
■ In-memory Cache
○ Cons:
■ No fixed SLAs
■ Concurrency
● Presto on Alluxio
○ Pros:
■ Data co-locality
■ Consistent query SLAs
○ Cons:
■ No in-memory cache
■ Limited ML support

Repertoire
Reference Example
Distribution of sessions by Operating System (OS), City and Gender based on an
event type.
SQL Representation
SELECT os, city, gender, hll_cardinality(hll_merge(session_id))
FROM events WHERE event_type = 'addToCart'
GROUP BY os, city, gender;

Repertoire
Metric Meta Store
Information about datasets and their storage
Constructs
● Namespaces
● Cubes
● Pre-fetcher
● Cache Manager
● Caching Policy

Repertoire
Prefetcher Service
Availability/Scheduling based dataset caching
● Extract smaller datasets and cache
Constructs
● Sources
● Transformations
● Fetch Frequency
● Cache Level
● Sinks

Repertoire
Alluxio
● Open sourced virtual distributed file system.
● Memory centric architecture.

Repertoire
Alluxio
● Data Locality and short-circuit
● Tiered Storage
● Multiple Caching Policies LRU, LRFU, FIFO
● Pluggable under storage
● Pin/unpin data
Performance tuning :
● Read location policy : DeterministicHashPolicy
● Disabled passive cache
● Write location policy : RoundRobinPolicy

Repertoire
HyperLogLogPlus
● Probabilistic cardinality estimation algorithm
● Why ?
○ Approx. cardinality without O(N) memory
SELECT os, city, gender, hll_cardinality(hll_merge(session_id))
FROM events WHERE event_type = 'addToCart'
GROUP BY os, city, gender;

Repertoire
HyperLogLogPlus
● Precision parameters
○ P : tune accuracy when dense mode
○ SP : control sparse mode
● Relative accuracy : 1.054 / sqrt(2^p)
● Spark and Presto UDAF

Repertoire
● Read only required data/event(s)
● Partition by events?
○ Too many small files
● Global sort?
○ Too expensive
● Bloom filters?
○ Not supported by Presto
● Localize data and sort within partitions!
Event Agnostic Aggregates

Repertoire
● Sorting:
○ bin partitioner
○ sort within partition
● Files size/no. of files ~1GB
● Stripe size ~ 64MB
ORC Optimizations

Repertoire
Funnel Analysis
Funnel Aggregate
def funnel(funnel_def, events_list) => [1, 1, 0]
device_id session_id dim1 dim2 events
d1 s1 v1 v2 [e1,e2,e3,e4...]

Repertoire
Some Benchmarks - Benchto
● Input Rows : 27.4 M
● Query runtime improved by 30-35 %
Query Complexity Presto (with Alluxio) Presto (with S3)
Light (sum) 23 sec 37 sec
Medium (HLL on one field) 44 sec 63 sec
Heavy (HLL on multi field) 49 sec 72 sec

Repertoire
Learnings
Presto
● Network Bottlenecks: Using 10Gbps line
● Enabling Disk spills
ORC Optimizations
● Binning and Sorting data
● Limiting number of files
● Stripe Size adherence
Alluxio
● Deterministic Hash Policy for reads from UnderFS
● Disabling passive cache
● Round Robin Policy for writes

Repertoire
Inflight
● Cache Management
● Prefetch Enhancements
○ Different Sources/Sinks
● Query Triaging
● Apache Atlas Integration
● Dedicated Metric meta-store

Repertoire
Down the road
● Compute Engines
○ Hive on Spark
○ Spark
● Caching intelligently
● Different Key-Store evaluation

Presto Bangalore Meetup1 Repertoire@Myntra

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Presto Bangalore Meetup1 Repertoire@Myntra

Similaire à Presto Bangalore Meetup1 Repertoire@Myntra (20)

Plus de Shubham Tagra

Plus de Shubham Tagra (11)

Dernier

Dernier (20)

Presto Bangalore Meetup1 Repertoire@Myntra