SlideShare a Scribd company logo
1 of 27
Download to read offline
Democratization of
Data
Why and how we built an internal data pipeline
platform @Indix
About me
Manoj Mahalingam
Principal Engineer @Indix
People
Documents Businesses
Places Products
Connected
Devices
Six Business Critical Indexes
Enabling businesses to build
location-aware software.
~3.6 million websites use Google maps
Enabling businesses to build
product-aware software.
Indix catalogs over 2.1 billion product offers
Indix - The “Google Maps” of Products
Crawling Pipeline
Data PipelineML
AggregateMatchStandardizeExtract AttributesClassifyDedupe
Parse
Crawl
Data
CrawlSeed
Brand & Retailer
Websites
Feeds Pipeline
Transform Clean Connect
Feed
Data
Brand & Retailer
Feeds
Indix Product
Catalog
Customizable
Feeds
Search &
Analytics
Index
Indexing PipelineReal Time
Index Analyze Derive Join
API
(Bulk &
Synchronous)
Product Data
Transformation
Service
Data Pipeline @Indix
Democratization
of Data
Enable everyone in the organization
to know what data is available, and
then understand and work with it.
At Indix, we have
and work with a lot
of data.
Scale of Data @ Indix
2.1
Billion
Product
URLs 8 TB
HTML Data
Crawled
Daily
1B
Unique
Products
7000
Categories
120 B
Price
Points
3000
Sites
● We have data in different
shapes and sizes.
● HTML pages, Thrift and
avro records.
● And also the usual suspects
- CSVs and plain text data.
● Datasets can be in TBs or a
few hundred KBs.
● Few billion records or a
couple of hundreds.
But...the data’s
potential couldn’t
be realized
Data wasn’t discoverable
● The biggest problem was in knowing what data exists and
where.
● Some of the data was in S3. Some in HDFS. Some in
Google sheets.
● There was no way to know how frequently and when the
data changed or updated.
The schema wasn’t readily
known
● The schema of the data, as expected, kept changing and it
was difficult to keep track of which version of data had
which schema.
● While Thrift and Avro alleviate this to an extent, access
to data wasn’t simple, especially for non-engineers.
Writing code limited scope
● We use Scalding and Spark for our MR jobs. Having to
code and tweak the jobs limited the scope of who can
write and run these jobs.
● “Readymade” jobs may not enable desired tweaks if
needed, affecting productivity and increasing
dependencies.
● Having to write code and ship jars hinders adhoc data
experimentation.
Cost control wasn’t trivial
● While data came in various sizes and shapes, what people
did with the data also varied - some use cases needed
sample of the data, while others wanted aggregations on
the entire data.
● It wasn’t trivial to handle all the different workloads while
minimizing costs.
● There was also the problem of adhoc jobs starving
production jobs in our existing Hadoop clusters.
Goals of Internal Data
Pipeline Platform
Enable easy discovery of
data.
Allow Schema to be
transparent and easy to
create while also allowing
introspection.
Minimal coding - have
prebuilt transformations for
common tasks and enable
SQL based workflow.
Goals of Internal Data
Pipeline Platform
UI and Wizard based
workflow to enable ANYONE
in the organization to run
pipelines and extract data.
Manage underlying clusters
and resources transparently
while optimizing for costs.
Support data
experimentations and also
production / customer use
cases.
MDA - Marketplace
of Datasets and
Algorithms
Tech Stack
MDA - DEMO!!!
MDA with our Data Pipeline
MatchAttributesBrandClassifyDedup
MDA with our Data Pipeline
MatchAttributesBrandClassifyDedup
Enrich Data Classify Brand
Feed data from
Customer
Feed output to
customer
MDA for ML Training Data
Filter Sample Preprocess
Training Data
Notebooks
//Setup the MDA client
import com.indix.holonet.core.client.SDKClient
val host = "holonet.force.io"
val port = 80
val client = SDKClient(host, port, spark)
//Create dataframe from any MDA dataset
val df = client.toDF("Indix", "PriceHistoryProfile")
df.show
Dec 2015
Start work on MDA
Mar 2016
First release
Lot more transforms
including sampling,
full Hive SQL support
and UX fixes
Late 2016
Performance
improvements, Spark
and infra upgrades.
June 2017
Ability to run pipelines
in customer’s cloud
infra
Jul 2016 Early 2017
Completely redesign
the UI based on over
year of feedback and
learnings. GraphQL for
the UI.
First closed preview of
MDA for a customer
Aug 2017
What does the future hold?
● We are far from done - things like automatic schema
inference, better caching are already planned.
● And as is the original vision, make it fully self-served for
our customers (internal and external.)
● Integration with other tools out there like Superset
● Open source as much as possible. First cut -
http://github.com/indix/sparkplug
Questions?
I blog at https://stacktoheap.com
Twitter and most other platforms @manojlds

More Related Content

What's hot

Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
JSON Data Modeling in Document Database
JSON Data Modeling in Document DatabaseJSON Data Modeling in Document Database
JSON Data Modeling in Document DatabaseDATAVERSITY
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityDATAVERSITY
 
Government GraphSummit: Leveraging Graphs for AI and ML
Government GraphSummit: Leveraging Graphs for AI and MLGovernment GraphSummit: Leveraging Graphs for AI and ML
Government GraphSummit: Leveraging Graphs for AI and MLNeo4j
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...Denodo
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?DATAVERSITY
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Denodo
 
Data Catalogues - Architecting for Collaboration & Self-Service
Data Catalogues - Architecting for Collaboration & Self-ServiceData Catalogues - Architecting for Collaboration & Self-Service
Data Catalogues - Architecting for Collaboration & Self-ServiceDATAVERSITY
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data EngineeringHarald Erb
 
The Importance of Master Data Management
The Importance of Master Data ManagementThe Importance of Master Data Management
The Importance of Master Data ManagementDATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessWhy an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessInformatica
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesLars E Martinsson
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphNeo4j
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsDenodo
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeDenodo
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceNeo4j
 

What's hot (20)

Data Mesh 101
Data Mesh 101Data Mesh 101
Data Mesh 101
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
JSON Data Modeling in Document Database
JSON Data Modeling in Document DatabaseJSON Data Modeling in Document Database
JSON Data Modeling in Document Database
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data Quality
 
Government GraphSummit: Leveraging Graphs for AI and ML
Government GraphSummit: Leveraging Graphs for AI and MLGovernment GraphSummit: Leveraging Graphs for AI and ML
Government GraphSummit: Leveraging Graphs for AI and ML
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
 
Data Catalogues - Architecting for Collaboration & Self-Service
Data Catalogues - Architecting for Collaboration & Self-ServiceData Catalogues - Architecting for Collaboration & Self-Service
Data Catalogues - Architecting for Collaboration & Self-Service
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
The Importance of Master Data Management
The Importance of Master Data ManagementThe Importance of Master Data Management
The Importance of Master Data Management
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessWhy an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business Success
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural Components
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data Lake
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
 

Viewers also liked

Disorder And Tolerance In Distributed Systems At Scale
Disorder And Tolerance In Distributed Systems At ScaleDisorder And Tolerance In Distributed Systems At Scale
Disorder And Tolerance In Distributed Systems At ScaleHelena Edelson
 
Oracle SQL Developer Tips & Tricks
Oracle SQL Developer Tips & TricksOracle SQL Developer Tips & Tricks
Oracle SQL Developer Tips & TricksJeff Smith
 
JustEnoughDevOpsForDataScientists
JustEnoughDevOpsForDataScientistsJustEnoughDevOpsForDataScientists
JustEnoughDevOpsForDataScientistsAnya Bida
 
Optimizing SlideShare for Twitter
Optimizing SlideShare for TwitterOptimizing SlideShare for Twitter
Optimizing SlideShare for TwitterKevin Baldacci
 
Business combinations
Business combinationsBusiness combinations
Business combinationsManish Kumar
 
3-D Leadership Model
3-D Leadership Model3-D Leadership Model
3-D Leadership ModelPaul Thornton
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
An overview of D2D in 3GPP LTE standard
An overview of D2D in 3GPP LTE standardAn overview of D2D in 3GPP LTE standard
An overview of D2D in 3GPP LTE standardssk
 
Mobile Network Sharing
Mobile Network SharingMobile Network Sharing
Mobile Network Sharing3G4G
 
Radio Frequency, Band and Spectrum
Radio Frequency, Band and SpectrumRadio Frequency, Band and Spectrum
Radio Frequency, Band and Spectrum3G4G
 
2G/3G Switch off Dates
2G/3G Switch off Dates2G/3G Switch off Dates
2G/3G Switch off Dates3G4G
 
Building the foundations of Ultra-RELIABLE and Low-LATENCY Wireless Communica...
Building the foundations of Ultra-RELIABLE and Low-LATENCY Wireless Communica...Building the foundations of Ultra-RELIABLE and Low-LATENCY Wireless Communica...
Building the foundations of Ultra-RELIABLE and Low-LATENCY Wireless Communica...3G4G
 

Viewers also liked (13)

Disorder And Tolerance In Distributed Systems At Scale
Disorder And Tolerance In Distributed Systems At ScaleDisorder And Tolerance In Distributed Systems At Scale
Disorder And Tolerance In Distributed Systems At Scale
 
Oracle SQL Developer Tips & Tricks
Oracle SQL Developer Tips & TricksOracle SQL Developer Tips & Tricks
Oracle SQL Developer Tips & Tricks
 
JustEnoughDevOpsForDataScientists
JustEnoughDevOpsForDataScientistsJustEnoughDevOpsForDataScientists
JustEnoughDevOpsForDataScientists
 
Optimizing SlideShare for Twitter
Optimizing SlideShare for TwitterOptimizing SlideShare for Twitter
Optimizing SlideShare for Twitter
 
Business combinations
Business combinationsBusiness combinations
Business combinations
 
3-D Leadership Model
3-D Leadership Model3-D Leadership Model
3-D Leadership Model
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
An overview of D2D in 3GPP LTE standard
An overview of D2D in 3GPP LTE standardAn overview of D2D in 3GPP LTE standard
An overview of D2D in 3GPP LTE standard
 
The Future of Leadership Development
The Future of Leadership DevelopmentThe Future of Leadership Development
The Future of Leadership Development
 
Mobile Network Sharing
Mobile Network SharingMobile Network Sharing
Mobile Network Sharing
 
Radio Frequency, Band and Spectrum
Radio Frequency, Band and SpectrumRadio Frequency, Band and Spectrum
Radio Frequency, Band and Spectrum
 
2G/3G Switch off Dates
2G/3G Switch off Dates2G/3G Switch off Dates
2G/3G Switch off Dates
 
Building the foundations of Ultra-RELIABLE and Low-LATENCY Wireless Communica...
Building the foundations of Ultra-RELIABLE and Low-LATENCY Wireless Communica...Building the foundations of Ultra-RELIABLE and Low-LATENCY Wireless Communica...
Building the foundations of Ultra-RELIABLE and Low-LATENCY Wireless Communica...
 

Similar to Democratization of Data @Indix

Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingCascading
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like ProductsVMware Tanzu
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearnCascading
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingAmazon Web Services
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorialrustd
 
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...marksimpsongw
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData Inc.
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureKovid Academy
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
Oracle Data Integration - Overview
Oracle Data Integration - OverviewOracle Data Integration - Overview
Oracle Data Integration - OverviewJeffrey T. Pollock
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
Informatica,Teradata,Oracle,SQL
Informatica,Teradata,Oracle,SQLInformatica,Teradata,Oracle,SQL
Informatica,Teradata,Oracle,SQLsivakumar s
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
 

Similar to Democratization of Data @Indix (20)

Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like Products
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data Warehousing
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
 
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architecture
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Oracle Data Integration - Overview
Oracle Data Integration - OverviewOracle Data Integration - Overview
Oracle Data Integration - Overview
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Informatica,Teradata,Oracle,SQL
Informatica,Teradata,Oracle,SQLInformatica,Teradata,Oracle,SQL
Informatica,Teradata,Oracle,SQL
 
SivakumarS
SivakumarSSivakumarS
SivakumarS
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Democratization of Data @Indix

  • 1. Democratization of Data Why and how we built an internal data pipeline platform @Indix
  • 4. Enabling businesses to build location-aware software. ~3.6 million websites use Google maps Enabling businesses to build product-aware software. Indix catalogs over 2.1 billion product offers Indix - The “Google Maps” of Products
  • 5. Crawling Pipeline Data PipelineML AggregateMatchStandardizeExtract AttributesClassifyDedupe Parse Crawl Data CrawlSeed Brand & Retailer Websites Feeds Pipeline Transform Clean Connect Feed Data Brand & Retailer Feeds Indix Product Catalog Customizable Feeds Search & Analytics Index Indexing PipelineReal Time Index Analyze Derive Join API (Bulk & Synchronous) Product Data Transformation Service Data Pipeline @Indix
  • 6. Democratization of Data Enable everyone in the organization to know what data is available, and then understand and work with it.
  • 7. At Indix, we have and work with a lot of data.
  • 8. Scale of Data @ Indix 2.1 Billion Product URLs 8 TB HTML Data Crawled Daily 1B Unique Products 7000 Categories 120 B Price Points 3000 Sites
  • 9. ● We have data in different shapes and sizes. ● HTML pages, Thrift and avro records. ● And also the usual suspects - CSVs and plain text data.
  • 10. ● Datasets can be in TBs or a few hundred KBs. ● Few billion records or a couple of hundreds.
  • 12. Data wasn’t discoverable ● The biggest problem was in knowing what data exists and where. ● Some of the data was in S3. Some in HDFS. Some in Google sheets. ● There was no way to know how frequently and when the data changed or updated.
  • 13. The schema wasn’t readily known ● The schema of the data, as expected, kept changing and it was difficult to keep track of which version of data had which schema. ● While Thrift and Avro alleviate this to an extent, access to data wasn’t simple, especially for non-engineers.
  • 14. Writing code limited scope ● We use Scalding and Spark for our MR jobs. Having to code and tweak the jobs limited the scope of who can write and run these jobs. ● “Readymade” jobs may not enable desired tweaks if needed, affecting productivity and increasing dependencies. ● Having to write code and ship jars hinders adhoc data experimentation.
  • 15. Cost control wasn’t trivial ● While data came in various sizes and shapes, what people did with the data also varied - some use cases needed sample of the data, while others wanted aggregations on the entire data. ● It wasn’t trivial to handle all the different workloads while minimizing costs. ● There was also the problem of adhoc jobs starving production jobs in our existing Hadoop clusters.
  • 16. Goals of Internal Data Pipeline Platform Enable easy discovery of data. Allow Schema to be transparent and easy to create while also allowing introspection. Minimal coding - have prebuilt transformations for common tasks and enable SQL based workflow.
  • 17. Goals of Internal Data Pipeline Platform UI and Wizard based workflow to enable ANYONE in the organization to run pipelines and extract data. Manage underlying clusters and resources transparently while optimizing for costs. Support data experimentations and also production / customer use cases.
  • 18. MDA - Marketplace of Datasets and Algorithms
  • 21. MDA with our Data Pipeline MatchAttributesBrandClassifyDedup
  • 22. MDA with our Data Pipeline MatchAttributesBrandClassifyDedup Enrich Data Classify Brand Feed data from Customer Feed output to customer
  • 23. MDA for ML Training Data Filter Sample Preprocess Training Data
  • 24. Notebooks //Setup the MDA client import com.indix.holonet.core.client.SDKClient val host = "holonet.force.io" val port = 80 val client = SDKClient(host, port, spark) //Create dataframe from any MDA dataset val df = client.toDF("Indix", "PriceHistoryProfile") df.show
  • 25. Dec 2015 Start work on MDA Mar 2016 First release Lot more transforms including sampling, full Hive SQL support and UX fixes Late 2016 Performance improvements, Spark and infra upgrades. June 2017 Ability to run pipelines in customer’s cloud infra Jul 2016 Early 2017 Completely redesign the UI based on over year of feedback and learnings. GraphQL for the UI. First closed preview of MDA for a customer Aug 2017
  • 26. What does the future hold? ● We are far from done - things like automatic schema inference, better caching are already planned. ● And as is the original vision, make it fully self-served for our customers (internal and external.) ● Integration with other tools out there like Superset ● Open source as much as possible. First cut - http://github.com/indix/sparkplug
  • 27. Questions? I blog at https://stacktoheap.com Twitter and most other platforms @manojlds