SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Repertoire
Myntra’s Data Serving Platform
Repertoire
Deepak Batra
Nishant Sharma
Rijo Joseph
Repertoire
Myntra - What do we do?
● Started as a customisation company in 2007, Myntra is
largest fashion e-tailer in India today 1, 2
● In 2016, acquired Jabong, to become India’s largest fashion
platform
● 60M+ app downloads for Myntra and Jabong apps on
Google Play store
● Myntra+Jabong list over 9M items for sale
● EORS (end of reason sale) is flagship sale event with over
2.8M orders, which are fulfilled within 7 days
● Focus on innovation with AI, AR/VR and omni-channel
based products
1 Estimated Based on publicly available numbers, Research reports for FY 2018
2 For core Fashion categories of Apparel, Accessories & footwear for FY 2018
FY18 online share in fashion 2
Repertoire
Tech at Myntra
● Key tech focus areas for Myntra + Jabong
○ Storefront: apps and web platform
○ Supply chain: end-to-end inventory & order management
○ Data tech: powering data based insights and intelligent automation in all business
areas
● Data tech covers all the sources and consumers of data within Myntra+Jabong
● Data sources include
○ Streaming data from apps and IoT devices
○ Content and campaign management systems for storefront apps
○ Transactional data from supply chain systems like order management (OMS),
warehouse and logistics management (WMS and LMS)
● Data is processed and served in both realtime and batch modes
● Consumers of data include reports/dashboards, tech products and data science
models
Repertoire
Data
Tech
@
Myntra
Repertoire
Challenges
● Tiered SLAs
○ Low Latency Data serving
○ What to cache & How to cache
● Compute
○ Roll-ups and drill-downs on the fly
● Multi-modal
○ Support for Key-value, SQL type queries
Repertoire
Challenges
● Query Triaging
○ Execution based on SLAs
● Low Latency Ingestion
○ Low ingestion overhead for real time & batch data
● Fault tolerance and NFRs
○ Availability, Horizontally Scalable, Isolation
Repertoire
Open Source Solutions
● Apache Ignite
○ Pros:
■ Indexes
■ Disk backed Cache
○ Cons:
■ Batch Ingestion
■ Uncompressed Data in Cache
● Presto on S3
○ Pros:
■ Stability
■ Out of the Box
○ Cons:
■ No data co-locality
■ Movement to Azure
● Spark on Alluxio
○ Pros:
■ Data co-locality
■ In-memory Cache
○ Cons:
■ No fixed SLAs
■ Concurrency
● Presto on Alluxio
○ Pros:
■ Data co-locality
■ Consistent query SLAs
○ Cons:
■ No in-memory cache
■ Limited ML support
Repertoire
Arch.
Repertoire
Arch.
Repertoire
Reference Example
Distribution of sessions by Operating System (OS), City and Gender based on an
event type.
SQL Representation
SELECT os, city, gender, hll_cardinality(hll_merge(session_id))
FROM events WHERE event_type = 'addToCart'
GROUP BY os, city, gender;
Repertoire
Arch.
Repertoire
Metric Meta Store
Information about datasets and their storage
Constructs
● Namespaces
● Cubes
● Pre-fetcher
● Cache Manager
● Caching Policy
Repertoire
Arch.
Repertoire
Prefetcher Service
Availability/Scheduling based dataset caching
● Extract smaller datasets and cache
Constructs
● Sources
● Transformations
● Fetch Frequency
● Cache Level
● Sinks
Repertoire
Query
Flow
Repertoire
Prefetch
Flow
Repertoire
Arch.
Repertoire
Alluxio
● Open sourced virtual distributed file system.
● Memory centric architecture.
Repertoire
Alluxio
Repertoire
Alluxio
● Data Locality and short-circuit
● Tiered Storage
● Multiple Caching Policies LRU, LRFU, FIFO
● Pluggable under storage
● Pin/unpin data
Performance tuning :
● Read location policy : DeterministicHashPolicy
● Disabled passive cache
● Write location policy : RoundRobinPolicy
Repertoire
HyperLogLogPlus
● Probabilistic cardinality estimation algorithm
● Why ?
○ Approx. cardinality without O(N) memory
SELECT os, city, gender, hll_cardinality(hll_merge(session_id))
FROM events WHERE event_type = 'addToCart'
GROUP BY os, city, gender;
Repertoire
HyperLogLogPlus
● Precision parameters
○ P : tune accuracy when dense mode
○ SP : control sparse mode
● Relative accuracy : 1.054 / sqrt(2^p)
● Spark and Presto UDAF
Repertoire
● Read only required data/event(s)
● Partition by events?
○ Too many small files
● Global sort?
○ Too expensive
● Bloom filters?
○ Not supported by Presto
● Localize data and sort within partitions!
Event Agnostic Aggregates
Repertoire
● Sorting:
○ bin partitioner
○ sort within partition
● Files size/no. of files ~1GB
● Stripe size ~ 64MB
ORC Optimizations
Repertoire
Funnel Analysis
Funnel Aggregate
def funnel(funnel_def, events_list) => [1, 1, 0]
device_id session_id dim1 dim2 events
d1 s1 v1 v2 [e1,e2,e3,e4...]
Repertoire
Some Benchmarks - Benchto
● Input Rows : 27.4 M
● Query runtime improved by 30-35 %
Query Complexity Presto (with Alluxio) Presto (with S3)
Light (sum) 23 sec 37 sec
Medium (HLL on one field) 44 sec 63 sec
Heavy (HLL on multi field) 49 sec 72 sec
Repertoire
Learnings
Presto
● Network Bottlenecks: Using 10Gbps line
● Enabling Disk spills
ORC Optimizations
● Binning and Sorting data
● Limiting number of files
● Stripe Size adherence
Alluxio
● Deterministic Hash Policy for reads from UnderFS
● Disabling passive cache
● Round Robin Policy for writes
Repertoire
Inflight
● Cache Management
● Prefetch Enhancements
○ Different Sources/Sinks
● Query Triaging
● Apache Atlas Integration
● Dedicated Metric meta-store
Repertoire
Down the road
● Compute Engines
○ Hive on Spark
○ Spark
● Caching intelligently
● Different Key-Store evaluation
Repertoire
Thank You

Contenu connexe

Tendances

TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)Kevin Xu
 
Introducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps MeetupIntroducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps MeetupKevin Xu
 
Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017Pat Hermens
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)Mihnea Giurgea
 
Introducing MagnetoDB, a key-value storage sevice for OpenStack
Introducing MagnetoDB, a key-value storage sevice for OpenStackIntroducing MagnetoDB, a key-value storage sevice for OpenStack
Introducing MagnetoDB, a key-value storage sevice for OpenStackMirantis
 
IOT Paris Seminar 2015 - Storage Challenges in IOT
IOT Paris Seminar 2015 - Storage Challenges in IOTIOT Paris Seminar 2015 - Storage Challenges in IOT
IOT Paris Seminar 2015 - Storage Challenges in IOTMongoDB
 
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...InfluxData
 
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataInfluxData
 
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
BigQueryで作る分析環境
BigQueryで作る分析環境BigQueryで作る分析環境
BigQueryで作る分析環境将央 山口
 
Graph Computing with Apache TinkerPop
Graph Computing with Apache TinkerPopGraph Computing with Apache TinkerPop
Graph Computing with Apache TinkerPopJason Plurad
 
DN 2017 | The King is Dead, Long Live the King | Thomas Richter | Swarm64
DN 2017 | The King is Dead, Long Live the King | Thomas Richter | Swarm64DN 2017 | The King is Dead, Long Live the King | Thomas Richter | Swarm64
DN 2017 | The King is Dead, Long Live the King | Thomas Richter | Swarm64Dataconomy Media
 
Collecting Endpoint Security Logs Through Big Data Technology - Dedi Dwianto
Collecting Endpoint Security Logs Through Big Data Technology - Dedi DwiantoCollecting Endpoint Security Logs Through Big Data Technology - Dedi Dwianto
Collecting Endpoint Security Logs Through Big Data Technology - Dedi Dwiantoidsecconf
 
WHODIS_kearns_presentation.v0a
WHODIS_kearns_presentation.v0aWHODIS_kearns_presentation.v0a
WHODIS_kearns_presentation.v0aEdward Kearns
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Miguel Pérez Colino
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Ido Green
 
Rapid analytic development on near real time data
Rapid analytic development on near real time dataRapid analytic development on near real time data
Rapid analytic development on near real time dataAustin Heyne
 

Tendances (20)

TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)
 
Introducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps MeetupIntroducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps Meetup
 
Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)
 
Introducing MagnetoDB, a key-value storage sevice for OpenStack
Introducing MagnetoDB, a key-value storage sevice for OpenStackIntroducing MagnetoDB, a key-value storage sevice for OpenStack
Introducing MagnetoDB, a key-value storage sevice for OpenStack
 
IOT Paris Seminar 2015 - Storage Challenges in IOT
IOT Paris Seminar 2015 - Storage Challenges in IOTIOT Paris Seminar 2015 - Storage Challenges in IOT
IOT Paris Seminar 2015 - Storage Challenges in IOT
 
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
 
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
 
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
BigQueryで作る分析環境
BigQueryで作る分析環境BigQueryで作る分析環境
BigQueryで作る分析環境
 
Graph Computing with Apache TinkerPop
Graph Computing with Apache TinkerPopGraph Computing with Apache TinkerPop
Graph Computing with Apache TinkerPop
 
How does one go from binary data to HDF files efficiently?
How does one go from binary data to HDF files efficiently?How does one go from binary data to HDF files efficiently?
How does one go from binary data to HDF files efficiently?
 
DN 2017 | The King is Dead, Long Live the King | Thomas Richter | Swarm64
DN 2017 | The King is Dead, Long Live the King | Thomas Richter | Swarm64DN 2017 | The King is Dead, Long Live the King | Thomas Richter | Swarm64
DN 2017 | The King is Dead, Long Live the King | Thomas Richter | Swarm64
 
Collecting Endpoint Security Logs Through Big Data Technology - Dedi Dwianto
Collecting Endpoint Security Logs Through Big Data Technology - Dedi DwiantoCollecting Endpoint Security Logs Through Big Data Technology - Dedi Dwianto
Collecting Endpoint Security Logs Through Big Data Technology - Dedi Dwianto
 
WHODIS_kearns_presentation.v0a
WHODIS_kearns_presentation.v0aWHODIS_kearns_presentation.v0a
WHODIS_kearns_presentation.v0a
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
 
JanusGraph DB
JanusGraph DBJanusGraph DB
JanusGraph DB
 
Rapid analytic development on near real time data
Rapid analytic development on near real time dataRapid analytic development on near real time data
Rapid analytic development on near real time data
 

Similaire à Presto Bangalore Meetup1 Repertoire@Myntra

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Jonathan Singer
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseGruter
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldJihoon Son
 

Similaire à Presto Bangalore Meetup1 Repertoire@Myntra (20)

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
Industrialiser spark
Industrialiser sparkIndustrialiser spark
Industrialiser spark
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 

Plus de Shubham Tagra

Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Shubham Tagra
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsShubham Tagra
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedShubham Tagra
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarShubham Tagra
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@GrabShubham Tagra
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019Shubham Tagra
 
Presto Bangalore Meetup1 Presto Raptor@ola
Presto Bangalore Meetup1 Presto Raptor@olaPresto Bangalore Meetup1 Presto Raptor@ola
Presto Bangalore Meetup1 Presto Raptor@olaShubham Tagra
 
Presto Bangalore Meetup1 Ranger+Presto@ola
Presto Bangalore Meetup1 Ranger+Presto@olaPresto Bangalore Meetup1 Ranger+Presto@ola
Presto Bangalore Meetup1 Ranger+Presto@olaShubham Tagra
 

Plus de Shubham Tagra (11)

Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@Grab
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
 
Presto Bangalore Meetup1 Presto Raptor@ola
Presto Bangalore Meetup1 Presto Raptor@olaPresto Bangalore Meetup1 Presto Raptor@ola
Presto Bangalore Meetup1 Presto Raptor@ola
 
Presto Bangalore Meetup1 Ranger+Presto@ola
Presto Bangalore Meetup1 Ranger+Presto@olaPresto Bangalore Meetup1 Ranger+Presto@ola
Presto Bangalore Meetup1 Ranger+Presto@ola
 
RubiX
RubiXRubiX
RubiX
 

Dernier

Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationShrmpro
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Dernier (20)

Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

Presto Bangalore Meetup1 Repertoire@Myntra

  • 1. Repertoire Myntra’s Data Serving Platform Repertoire Deepak Batra Nishant Sharma Rijo Joseph
  • 2. Repertoire Myntra - What do we do? ● Started as a customisation company in 2007, Myntra is largest fashion e-tailer in India today 1, 2 ● In 2016, acquired Jabong, to become India’s largest fashion platform ● 60M+ app downloads for Myntra and Jabong apps on Google Play store ● Myntra+Jabong list over 9M items for sale ● EORS (end of reason sale) is flagship sale event with over 2.8M orders, which are fulfilled within 7 days ● Focus on innovation with AI, AR/VR and omni-channel based products 1 Estimated Based on publicly available numbers, Research reports for FY 2018 2 For core Fashion categories of Apparel, Accessories & footwear for FY 2018 FY18 online share in fashion 2
  • 3. Repertoire Tech at Myntra ● Key tech focus areas for Myntra + Jabong ○ Storefront: apps and web platform ○ Supply chain: end-to-end inventory & order management ○ Data tech: powering data based insights and intelligent automation in all business areas ● Data tech covers all the sources and consumers of data within Myntra+Jabong ● Data sources include ○ Streaming data from apps and IoT devices ○ Content and campaign management systems for storefront apps ○ Transactional data from supply chain systems like order management (OMS), warehouse and logistics management (WMS and LMS) ● Data is processed and served in both realtime and batch modes ● Consumers of data include reports/dashboards, tech products and data science models
  • 5. Repertoire Challenges ● Tiered SLAs ○ Low Latency Data serving ○ What to cache & How to cache ● Compute ○ Roll-ups and drill-downs on the fly ● Multi-modal ○ Support for Key-value, SQL type queries
  • 6. Repertoire Challenges ● Query Triaging ○ Execution based on SLAs ● Low Latency Ingestion ○ Low ingestion overhead for real time & batch data ● Fault tolerance and NFRs ○ Availability, Horizontally Scalable, Isolation
  • 7. Repertoire Open Source Solutions ● Apache Ignite ○ Pros: ■ Indexes ■ Disk backed Cache ○ Cons: ■ Batch Ingestion ■ Uncompressed Data in Cache ● Presto on S3 ○ Pros: ■ Stability ■ Out of the Box ○ Cons: ■ No data co-locality ■ Movement to Azure ● Spark on Alluxio ○ Pros: ■ Data co-locality ■ In-memory Cache ○ Cons: ■ No fixed SLAs ■ Concurrency ● Presto on Alluxio ○ Pros: ■ Data co-locality ■ Consistent query SLAs ○ Cons: ■ No in-memory cache ■ Limited ML support
  • 10. Repertoire Reference Example Distribution of sessions by Operating System (OS), City and Gender based on an event type. SQL Representation SELECT os, city, gender, hll_cardinality(hll_merge(session_id)) FROM events WHERE event_type = 'addToCart' GROUP BY os, city, gender;
  • 12. Repertoire Metric Meta Store Information about datasets and their storage Constructs ● Namespaces ● Cubes ● Pre-fetcher ● Cache Manager ● Caching Policy
  • 14. Repertoire Prefetcher Service Availability/Scheduling based dataset caching ● Extract smaller datasets and cache Constructs ● Sources ● Transformations ● Fetch Frequency ● Cache Level ● Sinks
  • 18. Repertoire Alluxio ● Open sourced virtual distributed file system. ● Memory centric architecture.
  • 20. Repertoire Alluxio ● Data Locality and short-circuit ● Tiered Storage ● Multiple Caching Policies LRU, LRFU, FIFO ● Pluggable under storage ● Pin/unpin data Performance tuning : ● Read location policy : DeterministicHashPolicy ● Disabled passive cache ● Write location policy : RoundRobinPolicy
  • 21. Repertoire HyperLogLogPlus ● Probabilistic cardinality estimation algorithm ● Why ? ○ Approx. cardinality without O(N) memory SELECT os, city, gender, hll_cardinality(hll_merge(session_id)) FROM events WHERE event_type = 'addToCart' GROUP BY os, city, gender;
  • 22. Repertoire HyperLogLogPlus ● Precision parameters ○ P : tune accuracy when dense mode ○ SP : control sparse mode ● Relative accuracy : 1.054 / sqrt(2^p) ● Spark and Presto UDAF
  • 23. Repertoire ● Read only required data/event(s) ● Partition by events? ○ Too many small files ● Global sort? ○ Too expensive ● Bloom filters? ○ Not supported by Presto ● Localize data and sort within partitions! Event Agnostic Aggregates
  • 24. Repertoire ● Sorting: ○ bin partitioner ○ sort within partition ● Files size/no. of files ~1GB ● Stripe size ~ 64MB ORC Optimizations
  • 25. Repertoire Funnel Analysis Funnel Aggregate def funnel(funnel_def, events_list) => [1, 1, 0] device_id session_id dim1 dim2 events d1 s1 v1 v2 [e1,e2,e3,e4...]
  • 26. Repertoire Some Benchmarks - Benchto ● Input Rows : 27.4 M ● Query runtime improved by 30-35 % Query Complexity Presto (with Alluxio) Presto (with S3) Light (sum) 23 sec 37 sec Medium (HLL on one field) 44 sec 63 sec Heavy (HLL on multi field) 49 sec 72 sec
  • 27. Repertoire Learnings Presto ● Network Bottlenecks: Using 10Gbps line ● Enabling Disk spills ORC Optimizations ● Binning and Sorting data ● Limiting number of files ● Stripe Size adherence Alluxio ● Deterministic Hash Policy for reads from UnderFS ● Disabling passive cache ● Round Robin Policy for writes
  • 28. Repertoire Inflight ● Cache Management ● Prefetch Enhancements ○ Different Sources/Sinks ● Query Triaging ● Apache Atlas Integration ● Dedicated Metric meta-store
  • 29. Repertoire Down the road ● Compute Engines ○ Hive on Spark ○ Spark ● Caching intelligently ● Different Key-Store evaluation