2. Repertoire
Myntra - What do we do?
● Started as a customisation company in 2007, Myntra is
largest fashion e-tailer in India today 1, 2
● In 2016, acquired Jabong, to become India’s largest fashion
platform
● 60M+ app downloads for Myntra and Jabong apps on
Google Play store
● Myntra+Jabong list over 9M items for sale
● EORS (end of reason sale) is flagship sale event with over
2.8M orders, which are fulfilled within 7 days
● Focus on innovation with AI, AR/VR and omni-channel
based products
1 Estimated Based on publicly available numbers, Research reports for FY 2018
2 For core Fashion categories of Apparel, Accessories & footwear for FY 2018
FY18 online share in fashion 2
3. Repertoire
Tech at Myntra
● Key tech focus areas for Myntra + Jabong
○ Storefront: apps and web platform
○ Supply chain: end-to-end inventory & order management
○ Data tech: powering data based insights and intelligent automation in all business
areas
● Data tech covers all the sources and consumers of data within Myntra+Jabong
● Data sources include
○ Streaming data from apps and IoT devices
○ Content and campaign management systems for storefront apps
○ Transactional data from supply chain systems like order management (OMS),
warehouse and logistics management (WMS and LMS)
● Data is processed and served in both realtime and batch modes
● Consumers of data include reports/dashboards, tech products and data science
models
5. Repertoire
Challenges
● Tiered SLAs
○ Low Latency Data serving
○ What to cache & How to cache
● Compute
○ Roll-ups and drill-downs on the fly
● Multi-modal
○ Support for Key-value, SQL type queries
6. Repertoire
Challenges
● Query Triaging
○ Execution based on SLAs
● Low Latency Ingestion
○ Low ingestion overhead for real time & batch data
● Fault tolerance and NFRs
○ Availability, Horizontally Scalable, Isolation
7. Repertoire
Open Source Solutions
● Apache Ignite
○ Pros:
■ Indexes
■ Disk backed Cache
○ Cons:
■ Batch Ingestion
■ Uncompressed Data in Cache
● Presto on S3
○ Pros:
■ Stability
■ Out of the Box
○ Cons:
■ No data co-locality
■ Movement to Azure
● Spark on Alluxio
○ Pros:
■ Data co-locality
■ In-memory Cache
○ Cons:
■ No fixed SLAs
■ Concurrency
● Presto on Alluxio
○ Pros:
■ Data co-locality
■ Consistent query SLAs
○ Cons:
■ No in-memory cache
■ Limited ML support
10. Repertoire
Reference Example
Distribution of sessions by Operating System (OS), City and Gender based on an
event type.
SQL Representation
SELECT os, city, gender, hll_cardinality(hll_merge(session_id))
FROM events WHERE event_type = 'addToCart'
GROUP BY os, city, gender;
21. Repertoire
HyperLogLogPlus
● Probabilistic cardinality estimation algorithm
● Why ?
○ Approx. cardinality without O(N) memory
SELECT os, city, gender, hll_cardinality(hll_merge(session_id))
FROM events WHERE event_type = 'addToCart'
GROUP BY os, city, gender;
23. Repertoire
● Read only required data/event(s)
● Partition by events?
○ Too many small files
● Global sort?
○ Too expensive
● Bloom filters?
○ Not supported by Presto
● Localize data and sort within partitions!
Event Agnostic Aggregates
24. Repertoire
● Sorting:
○ bin partitioner
○ sort within partition
● Files size/no. of files ~1GB
● Stripe size ~ 64MB
ORC Optimizations
26. Repertoire
Some Benchmarks - Benchto
● Input Rows : 27.4 M
● Query runtime improved by 30-35 %
Query Complexity Presto (with Alluxio) Presto (with S3)
Light (sum) 23 sec 37 sec
Medium (HLL on one field) 44 sec 63 sec
Heavy (HLL on multi field) 49 sec 72 sec
27. Repertoire
Learnings
Presto
● Network Bottlenecks: Using 10Gbps line
● Enabling Disk spills
ORC Optimizations
● Binning and Sorting data
● Limiting number of files
● Stripe Size adherence
Alluxio
● Deterministic Hash Policy for reads from UnderFS
● Disabling passive cache
● Round Robin Policy for writes