SlideShare une entreprise Scribd logo
1  sur  35
Building a Data Pipeline - Case studies
Amit Sharma, Director @NoBroker
Raam Baranidharan, Associate Director @Treebo
Jitendra Agrawal, VP Technology @ LendingKart
18th August 2018
2
Building data pipelines @NoBroker
-Amit Sharma
Director Engineering
• Business Needs Data.
• Analytics is computationally taxing.
• Data exists in multiple platforms
• CRM
• APPs
• Calls
• 3rd Party Webhooks
3
Why Data Pipeline
• Moving, Joining and re-formatting data between
systems.
• A Data Pipeline is sum of all these steps
• Its Job is to ensure these steps all happen reliably on
all data.
4
What is Data Pipeline
• Sources
• Joins
• Extractions
• Standardization/Corrections
• Loads
• Automation
Parts And Processes of Data Pipeline
How We use Data Pipeline
DATA
HARVESTER
NB ESTIMATES
- Rentometer
- Prop Worth
- Lifestyle Score
- Commute Score
SIA
- Synthetic Language
Generator
OCULUS
- Image Intelligence
LITMUS
- Text Profanity Engine
MASS
- Property Quality
Estimates
HURACAN
- Lead Intelligence
JARVIS
- Automatic Speech
Recognition Engine
(NB Voice Asst.)
TROUBLESHOOT
- Sentiment Analysis
Engine
DEEP LEON
- Large Scale Deep
Learning Framework
Website Mobile CRM Calls
User
Footprint
Third Party
Data
Property
Data
Prism
ML Engine
INFERENCE
SYSTEMS
HOROSCOPE QUICKSILVER
CLICK & EARN SMART SALES
NB BOT
SMART
FOLLOW UPS
DEMAND -
SUPPLY
ANALYTICS
SMART
GRIEVANCE
REDRESSAL
VIGILANTE
WORK IN PROGRESS
• With Analytics Data, Scale matters
• One server is never enough
• Once a data pipeline is source of truth, Reliability
matters
• Without enrichments, It’s hard to derive insights
Lessons from building data pipeline
Any Questions?
Data World @ Treebo
A perspective & some musings
Sneh
Presentation flow
❖ GROUND 0 Initial 3 decks prepare the ground
❖ GET Following 3 decks set the broader context around -
➢ Problem Statement
➢ Some fundamentals around systems, storage & general considerations
❖ SET Intermediate 2 slides deep dive into -
➢ Different phases and the choices around AWS technology for them
❖ GO Concluding 2 slides talk about -
➢ The overall architecture put together
➢ Progressive thoughts
What is Data Strategy (DS)? .. ⅓
A set of techniques around the collection, storage and usage of a data, in a manner that
data can serve not only the operations of a company, but also open up additional
potential monetisation avenues in the future.
A good DS has to be actionable and, at the same time, evolutionary to adjust to
disruptive market forces.
DS always has to be business-driven, never technology-driven.
Is DS worth it? .. ⅔
The data that we, as Treebo, own is a resource that has economic value and we expect
it to provide future benefit just like any other asset.
But, generally, data is an under-managed, underutilized asset because it doesn’t feature
in company’s P&L book closing.
To look at it differently: As we have people-focused strategy to retain employees (our asset),
similarly a data-focused strategy is required to retain good data (our asset, again)!
Without DS, we will be forced to deal with myriad data-related
initiatives taken up by various business groups.
❖ Planning and discovery.
Identify business objectives, key stakeholders & scope.
❖ Current state assessment.
Focus on business process, data sources, technology stack & policies.
❖ Analysis, prioritization and roadmap.
Requirement analysis, criteria for prioritization & layout initiative roadmap.
❖ Change management.
Encompass organizational change, cultural change, technology change, and changes in business processes.
High-level Framework for DS .. 3/3
Problem Statement .. GET ⅓
To Design highly scalable, highly available, low latency data platform which can capture, move
and transform transactions with zero data loss and should support replay capability when required.
So, essentially, a system needs to be designed which is/has:
Highly Scalable; Highly Available; Low Latency; Zero data loss; Replay capability
Golden Rules .. GET ⅔
❖ Do not go distributed if the data size is small enough.
Any distributed system takes 10 years to debug; any database takes 10 years to debug; and
any distributed database is still being debugged!
❖ Do not go streaming if batches serve well.
The above two rules hold true for practically
all data initiatives.
Revisiting Some Fundamentals .. GET 3/3
❖ Scalability, Availability, Consistency, Latency, Durability, Disaster Recovery
^ Processing
❖ RDBMSs & touch-base with above features ^
^ Source
❖ On-premise/Hybrid/Cloud
^ Hosting
❖ Source
Logs, tools, open libraries, proprietary solutions
❖ Fetches/Transient storage
Ordering, delivery semantics, consistency guarantee, schema evolvability
❖ Processing
Inflight/at the destination, batch, (near) real time
❖ Destination
Hardware, SQL (No/new), Columnar
❖ Cache
Optional!
❖ Visualisation
Various options suited to use case
Different phases .. SET ½
Different phase choices .. SET 2/2
Treebo Architecture .. GO ½
❖ Append-only event logging for immutability (Kappa architecture)
❖ Ensure idempotency
❖ Custom checkpointing for better replay
❖ Specialised storage formats
❖ Data governance & workload management
❖ Transition higher up the matured analytics value pyramid
Progressive thoughts .. GO 2/2
“I’m sure, the highest capacity of storage device, will not be
enough to record all our stories; because, everytime with you is
very valuable data.”
Not really sure if this was said by somebody with reference to technology or their
love interest! :)
Q & A
Thank You
Data Pipeline @ LK
By Jitendra Agrawal
Types of data
Event stream
Basic Lambda Architecture
Message
Queue
Real-time
processing
Batch
processing
Queries
Responses
Queries
Responses
Stream vs. batch
● Stream / speed layer
○ Processing - Apache Storm, Apache Spark, Apache Samza
○ Store - ElasticSearch, Druid, Spark SQL, Other DBs
○ Usage
■ Live dashboards (potentially inaccurate)
● Counts, Averages
■ Rate limiting
■ Triggers for further action
● Batch
○ Immutable(?) store
■ HDFS
■ Cassandra
■ Event stream to S3
○ Batch processing and precomputation
○ Data warehouse - HBase, Hive, Redshift, Postgres
Database change logs
● MySQL
○ Row level bin logs
○ Debezium -> Kafka
○ Before and after values
○ Handles database restarts / restreams data (duplicates)
● MongoDB
○ Op log
○ Oplog reader -> Kafka
○ After values
○ Handles database restarts
Data @ LK
● Multiple self MySQL instances (Application)
● On-premise MySQL installation (Calls)
● MongoDB (Application)
● Mixpanel
● Facebook Ads
● Google Ads
● Mandrill
● A couple of terabytes and increasing rapidly
Motivation for considering a data warehouse
● Joins across multiple databases
● MySQL just can’t run some analytics queries
● Some of the ‘changes’ are not sent to Mixpanel as events
● A lot of questions are asked on data retrospectively
Data warehouse inputs
● Mysql
○ Sync current states of all databases to Redshift
○ Send all changes in tables to Kafka
■ Debezium
○ All before / after values for changes are stored in S3
○ S3 data is processed to create audit trail tables
● MongoDB
○ Send all changes in collections to Kafka
■ Oplog Reader
○ All changes are stored in S3
○ S3 data is processed to create copy of MongoDB and audit trail
● Store all changes
○ Filtering for duplicates can be done later
Lendingkart Architecture
Questions?

Contenu connexe

Tendances

Tendances (18)

Introducing MongoDB Stitch, Backend-as-a-Service from MongoDB
Introducing MongoDB Stitch, Backend-as-a-Service from MongoDBIntroducing MongoDB Stitch, Backend-as-a-Service from MongoDB
Introducing MongoDB Stitch, Backend-as-a-Service from MongoDB
 
MongoDB Operations for Developers
MongoDB Operations for DevelopersMongoDB Operations for Developers
MongoDB Operations for Developers
 
How leading financial services organisations are winning with tech
How leading financial services organisations are winning with techHow leading financial services organisations are winning with tech
How leading financial services organisations are winning with tech
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needs
 
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
 
RealTime Recommendations @Netflix - Spark
RealTime Recommendations @Netflix - SparkRealTime Recommendations @Netflix - Spark
RealTime Recommendations @Netflix - Spark
 
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demandsMongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
 
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
SAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix ScaleSAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix Scale
 
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
 
WSO2 Data Services Server - Product Overview
WSO2 Data Services Server - Product OverviewWSO2 Data Services Server - Product Overview
WSO2 Data Services Server - Product Overview
 

Similaire à Ledingkart Meetup #4: Data pipeline @ lk

Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 

Similaire à Ledingkart Meetup #4: Data pipeline @ lk (20)

Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Understanding System Design and Architecture Blueprints of Efficiency
Understanding System Design and Architecture Blueprints of EfficiencyUnderstanding System Design and Architecture Blueprints of Efficiency
Understanding System Design and Architecture Blueprints of Efficiency
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 

Dernier

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Ledingkart Meetup #4: Data pipeline @ lk

  • 1. Building a Data Pipeline - Case studies Amit Sharma, Director @NoBroker Raam Baranidharan, Associate Director @Treebo Jitendra Agrawal, VP Technology @ LendingKart 18th August 2018
  • 2. 2 Building data pipelines @NoBroker -Amit Sharma Director Engineering
  • 3. • Business Needs Data. • Analytics is computationally taxing. • Data exists in multiple platforms • CRM • APPs • Calls • 3rd Party Webhooks 3 Why Data Pipeline
  • 4. • Moving, Joining and re-formatting data between systems. • A Data Pipeline is sum of all these steps • Its Job is to ensure these steps all happen reliably on all data. 4 What is Data Pipeline
  • 5. • Sources • Joins • Extractions • Standardization/Corrections • Loads • Automation Parts And Processes of Data Pipeline
  • 6. How We use Data Pipeline
  • 7. DATA HARVESTER NB ESTIMATES - Rentometer - Prop Worth - Lifestyle Score - Commute Score SIA - Synthetic Language Generator OCULUS - Image Intelligence LITMUS - Text Profanity Engine MASS - Property Quality Estimates HURACAN - Lead Intelligence JARVIS - Automatic Speech Recognition Engine (NB Voice Asst.) TROUBLESHOOT - Sentiment Analysis Engine DEEP LEON - Large Scale Deep Learning Framework Website Mobile CRM Calls User Footprint Third Party Data Property Data Prism ML Engine INFERENCE SYSTEMS HOROSCOPE QUICKSILVER CLICK & EARN SMART SALES NB BOT SMART FOLLOW UPS DEMAND - SUPPLY ANALYTICS SMART GRIEVANCE REDRESSAL VIGILANTE WORK IN PROGRESS
  • 8. • With Analytics Data, Scale matters • One server is never enough • Once a data pipeline is source of truth, Reliability matters • Without enrichments, It’s hard to derive insights Lessons from building data pipeline
  • 10. Data World @ Treebo A perspective & some musings Sneh
  • 11. Presentation flow ❖ GROUND 0 Initial 3 decks prepare the ground ❖ GET Following 3 decks set the broader context around - ➢ Problem Statement ➢ Some fundamentals around systems, storage & general considerations ❖ SET Intermediate 2 slides deep dive into - ➢ Different phases and the choices around AWS technology for them ❖ GO Concluding 2 slides talk about - ➢ The overall architecture put together ➢ Progressive thoughts
  • 12. What is Data Strategy (DS)? .. ⅓ A set of techniques around the collection, storage and usage of a data, in a manner that data can serve not only the operations of a company, but also open up additional potential monetisation avenues in the future. A good DS has to be actionable and, at the same time, evolutionary to adjust to disruptive market forces. DS always has to be business-driven, never technology-driven.
  • 13. Is DS worth it? .. ⅔ The data that we, as Treebo, own is a resource that has economic value and we expect it to provide future benefit just like any other asset. But, generally, data is an under-managed, underutilized asset because it doesn’t feature in company’s P&L book closing. To look at it differently: As we have people-focused strategy to retain employees (our asset), similarly a data-focused strategy is required to retain good data (our asset, again)! Without DS, we will be forced to deal with myriad data-related initiatives taken up by various business groups.
  • 14. ❖ Planning and discovery. Identify business objectives, key stakeholders & scope. ❖ Current state assessment. Focus on business process, data sources, technology stack & policies. ❖ Analysis, prioritization and roadmap. Requirement analysis, criteria for prioritization & layout initiative roadmap. ❖ Change management. Encompass organizational change, cultural change, technology change, and changes in business processes. High-level Framework for DS .. 3/3
  • 15. Problem Statement .. GET ⅓ To Design highly scalable, highly available, low latency data platform which can capture, move and transform transactions with zero data loss and should support replay capability when required. So, essentially, a system needs to be designed which is/has: Highly Scalable; Highly Available; Low Latency; Zero data loss; Replay capability
  • 16. Golden Rules .. GET ⅔ ❖ Do not go distributed if the data size is small enough. Any distributed system takes 10 years to debug; any database takes 10 years to debug; and any distributed database is still being debugged! ❖ Do not go streaming if batches serve well. The above two rules hold true for practically all data initiatives.
  • 17. Revisiting Some Fundamentals .. GET 3/3 ❖ Scalability, Availability, Consistency, Latency, Durability, Disaster Recovery ^ Processing ❖ RDBMSs & touch-base with above features ^ ^ Source ❖ On-premise/Hybrid/Cloud ^ Hosting
  • 18. ❖ Source Logs, tools, open libraries, proprietary solutions ❖ Fetches/Transient storage Ordering, delivery semantics, consistency guarantee, schema evolvability ❖ Processing Inflight/at the destination, batch, (near) real time ❖ Destination Hardware, SQL (No/new), Columnar ❖ Cache Optional! ❖ Visualisation Various options suited to use case Different phases .. SET ½
  • 21. ❖ Append-only event logging for immutability (Kappa architecture) ❖ Ensure idempotency ❖ Custom checkpointing for better replay ❖ Specialised storage formats ❖ Data governance & workload management ❖ Transition higher up the matured analytics value pyramid Progressive thoughts .. GO 2/2
  • 22. “I’m sure, the highest capacity of storage device, will not be enough to record all our stories; because, everytime with you is very valuable data.” Not really sure if this was said by somebody with reference to technology or their love interest! :) Q & A Thank You
  • 23. Data Pipeline @ LK By Jitendra Agrawal
  • 26. Stream vs. batch ● Stream / speed layer ○ Processing - Apache Storm, Apache Spark, Apache Samza ○ Store - ElasticSearch, Druid, Spark SQL, Other DBs ○ Usage ■ Live dashboards (potentially inaccurate) ● Counts, Averages ■ Rate limiting ■ Triggers for further action ● Batch ○ Immutable(?) store ■ HDFS ■ Cassandra ■ Event stream to S3 ○ Batch processing and precomputation ○ Data warehouse - HBase, Hive, Redshift, Postgres
  • 27. Database change logs ● MySQL ○ Row level bin logs ○ Debezium -> Kafka ○ Before and after values ○ Handles database restarts / restreams data (duplicates) ● MongoDB ○ Op log ○ Oplog reader -> Kafka ○ After values ○ Handles database restarts
  • 28. Data @ LK ● Multiple self MySQL instances (Application) ● On-premise MySQL installation (Calls) ● MongoDB (Application) ● Mixpanel ● Facebook Ads ● Google Ads ● Mandrill ● A couple of terabytes and increasing rapidly
  • 29. Motivation for considering a data warehouse ● Joins across multiple databases ● MySQL just can’t run some analytics queries ● Some of the ‘changes’ are not sent to Mixpanel as events ● A lot of questions are asked on data retrospectively
  • 30. Data warehouse inputs ● Mysql ○ Sync current states of all databases to Redshift ○ Send all changes in tables to Kafka ■ Debezium ○ All before / after values for changes are stored in S3 ○ S3 data is processed to create audit trail tables ● MongoDB ○ Send all changes in collections to Kafka ■ Oplog Reader ○ All changes are stored in S3 ○ S3 data is processed to create copy of MongoDB and audit trail ● Store all changes ○ Filtering for duplicates can be done later
  • 32.
  • 33.
  • 34.