SlideShare une entreprise Scribd logo
1  sur  38
Handling the
growth of Data
Piyush Katariya
@AhamPiyush
Growth of Data
What is ?
How to solve?
Which metrics to consider ?
Dive into design internals
Vertical Scaling
Let’s start with small database, say Postgres
● Few RDBMS tables with relationships
● Average Stats
○ thousands of rows per table
○ OLAP - several hundreds of real time queries
○ OLTP - few hundreds of updates
● Optimizations
○ Single Node
○ Reasonable CPU frequencies
○ Indexes - Unique, Single BTree, Compound
○ Modern SSD
○ Buffer Cache
○ In memory ( $ )
Relatively larger database (1)
● Few or More RDBMS tables with relationships
● Average Stats
○ Few Millions of rows per table
○ Schemaless events data
○ OLAP - Several thousands of real time queries
○ OLTP - Few thousands of updates
● Optimizations
○ Master Slave with read replica
○ JSON fields
○ Reasonably higher CPU frequencies
○ Advanced Indexes - Block range Index, BitMap index, Partial Index, Functional Index
○ Scheduled ReIndexing Jobs
○ Table Partitioning
○ Materialized views
○ Async commits
○ RAID 10
○ Caching solutions - View layer, Service layer , ORM layer, Database layer
○ In memory ( $$$ )
Relatively larger database (2)
● Few or More RDBMS tables with relationships
● Average Stats
○ Hundred Millions of rows
○ Schemaless events, audits, analytics data, real time decisions based on events
○ Hundreds of thousands of real time queries
○ Hundreds of thousands of updates
● Optimization ???
○ Data just can’t fit in SQL engine in inexpensive way
○ Sharding - Data can’t fit in Single node
○ Traditional tools falls short
Horizontal Scaling
Research Papers by Google
Google File System (2003)
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
Map Reduce (2004)
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi
04.pdf
Big Table (2006)
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.p
df
Map Reduce
HDFS and Hadoop - Alternative to GFS
MapR-FS as Better Alternative
NoSQL Databases
● CAP Theorem compliant Distributed data structures
● Tunable Consistency and Availability at DB or Query level
● Don’t try to solve all problems but very specific ones
● Data model specific
○ Data distribution across machines/data centers
○ Data replication for reliability and fault tolerance
○ Data denormalization
○ Physical storage layouts
○ Data compression
○ Querying and Aggregation techniques
● Automatic failover
● Distributed clock synchronization
● Multi data center support
● Integration plugins with other databases
● Community
CAP - Choose any 2
Consistency - Consistent view of dataset
Availability - Read and write at any time
Partitioning - Split data across machines
BigTable based DB Design
Gossip Protocol (AP)
RAFT Consensus (CP)
Key Value Databases
Key Value Database - Riak (AP)
Column Family Database - Cluster
Column Family Databases - Physical layout
Column Family - Cassandra (AP)
Column Family - HBase (CP)
Document Database
Document Database - MongoDB (CP)
Graph Database
Graph Database - TitanDB layer (CP/CA/AP)
Search Engines and Logs
Distributed Queue/Log/Buffer
Computing Engines - HDFS and Spark
Computing Engines - MapR Stack
NewSQL
(F1 and) Google Spanner (2012)
http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi201
2.pdf
Spanner : Becoming a SQL System (2017)
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf
“ There are two ways of constructing a software design. One way is to make it so simple that
there are obviously no deficiencies. And the other way is to make it so complicated that there
are no obvious deficiencies. ”
—C.A.R. Hoare
Open Source Spanner - CockroachDB (CP)
Conclusion ?
Ask Contextual Questions
Do you really can’t afford hosting all of the data on Single Machine ?
Is your data highly connected or independent ?
Is your Primary workload OLTP (CP) or OLAP (CA) ?
Are your customers geographically distributed ?
Do you need to coordinate and scale business services without overwhelming primary data store?
How much latency are you aiming for ? How much can you compromise on it ?
How much are you willing to spend on infrastructure cost ?
What’s the skill competency level of the dev team ?
What is your target Time to Market SLA for new or changing features ?
Accept Trade Offs
Connected (Graph ) or Relational (SQL) Data
Availability
Storage Space(Volatile or Persistent)
Data Encryption
Range Sharding
Synchronous RPC
Batch Processing
Embedded Computation Engine
Independent Data
Consistency
Computation
Computation
Hash Sharding
Async and Reactive
Stream Processing
Separate Computation Engine
VS
My (Biased) Recommendation
MongoDB for medium complex load and developer productivity
CockroachDB as Primary database
Large OLTP and OLAP loads - ScyllaDB (Hash Sharding) and MapR-DB (Range Sharding)
Druid for real time OLAP
Titan / Janus Graph and ScyllaDB for (highly connected) graph data
Redis HA Cluster for short lived distributed data structures
Kafka or Pulsar as Distributed queue/buffer
Prefer MapR as Hadoop platform
Thanks
@AhamPiyush

Contenu connexe

Tendances

Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
TechEvent Time Seriesd Databases
TechEvent Time Seriesd DatabasesTechEvent Time Seriesd Databases
TechEvent Time Seriesd DatabasesTrivadis
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil AmbagadeSigmoid
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres Regunath B
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratchVinayak Hegde
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQueryCsaba Toth
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero Lars Albertsson
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data SuccessLars Albertsson
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAlex Pinkin
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseliqiang xu
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldJihoon Son
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBOkbajda
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaObjectRocket
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostDatabricks
 

Tendances (20)

Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
TechEvent Time Seriesd Databases
TechEvent Time Seriesd DatabasesTechEvent Time Seriesd Databases
TechEvent Time Seriesd Databases
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Google mesa
Google mesaGoogle mesa
Google mesa
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouse
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
 

Similaire à Handling the growth of data

Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
NoSQL Options Compared
NoSQL Options ComparedNoSQL Options Compared
NoSQL Options ComparedSergey Bushik
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architecturesRaji Gogulapati
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's studentsMohamed Nadjib MAMI
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 

Similaire à Handling the growth of data (20)

Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
NoSQL Options Compared
NoSQL Options ComparedNoSQL Options Compared
NoSQL Options Compared
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Asd 2015
Asd 2015Asd 2015
Asd 2015
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architectures
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 

Plus de Piyush Katariya

Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IOPiyush Katariya
 
JavaScript for Enterprise Applications
JavaScript for Enterprise ApplicationsJavaScript for Enterprise Applications
JavaScript for Enterprise ApplicationsPiyush Katariya
 
My inspirations and learned lessons
My inspirations and learned lessonsMy inspirations and learned lessons
My inspirations and learned lessonsPiyush Katariya
 
JavaScript (without DOM)
JavaScript (without DOM)JavaScript (without DOM)
JavaScript (without DOM)Piyush Katariya
 
Rise of the Single Page Application
Rise of the Single Page ApplicationRise of the Single Page Application
Rise of the Single Page ApplicationPiyush Katariya
 
Introduction to Web Application Clustering
Introduction to Web Application ClusteringIntroduction to Web Application Clustering
Introduction to Web Application ClusteringPiyush Katariya
 

Plus de Piyush Katariya (9)

Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IO
 
Expression problem
Expression problemExpression problem
Expression problem
 
JavaScript for Enterprise Applications
JavaScript for Enterprise ApplicationsJavaScript for Enterprise Applications
JavaScript for Enterprise Applications
 
Thinking Functionally
Thinking FunctionallyThinking Functionally
Thinking Functionally
 
My inspirations and learned lessons
My inspirations and learned lessonsMy inspirations and learned lessons
My inspirations and learned lessons
 
What is scala
What is scalaWhat is scala
What is scala
 
JavaScript (without DOM)
JavaScript (without DOM)JavaScript (without DOM)
JavaScript (without DOM)
 
Rise of the Single Page Application
Rise of the Single Page ApplicationRise of the Single Page Application
Rise of the Single Page Application
 
Introduction to Web Application Clustering
Introduction to Web Application ClusteringIntroduction to Web Application Clustering
Introduction to Web Application Clustering
 

Dernier

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Dernier (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Handling the growth of data

  • 1. Handling the growth of Data Piyush Katariya @AhamPiyush
  • 2. Growth of Data What is ? How to solve? Which metrics to consider ? Dive into design internals
  • 4. Let’s start with small database, say Postgres ● Few RDBMS tables with relationships ● Average Stats ○ thousands of rows per table ○ OLAP - several hundreds of real time queries ○ OLTP - few hundreds of updates ● Optimizations ○ Single Node ○ Reasonable CPU frequencies ○ Indexes - Unique, Single BTree, Compound ○ Modern SSD ○ Buffer Cache ○ In memory ( $ )
  • 5. Relatively larger database (1) ● Few or More RDBMS tables with relationships ● Average Stats ○ Few Millions of rows per table ○ Schemaless events data ○ OLAP - Several thousands of real time queries ○ OLTP - Few thousands of updates ● Optimizations ○ Master Slave with read replica ○ JSON fields ○ Reasonably higher CPU frequencies ○ Advanced Indexes - Block range Index, BitMap index, Partial Index, Functional Index ○ Scheduled ReIndexing Jobs ○ Table Partitioning ○ Materialized views ○ Async commits ○ RAID 10 ○ Caching solutions - View layer, Service layer , ORM layer, Database layer ○ In memory ( $$$ )
  • 6. Relatively larger database (2) ● Few or More RDBMS tables with relationships ● Average Stats ○ Hundred Millions of rows ○ Schemaless events, audits, analytics data, real time decisions based on events ○ Hundreds of thousands of real time queries ○ Hundreds of thousands of updates ● Optimization ??? ○ Data just can’t fit in SQL engine in inexpensive way ○ Sharding - Data can’t fit in Single node ○ Traditional tools falls short
  • 8. Research Papers by Google Google File System (2003) http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf Map Reduce (2004) http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi 04.pdf Big Table (2006) http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.p df
  • 10. HDFS and Hadoop - Alternative to GFS
  • 11. MapR-FS as Better Alternative
  • 12. NoSQL Databases ● CAP Theorem compliant Distributed data structures ● Tunable Consistency and Availability at DB or Query level ● Don’t try to solve all problems but very specific ones ● Data model specific ○ Data distribution across machines/data centers ○ Data replication for reliability and fault tolerance ○ Data denormalization ○ Physical storage layouts ○ Data compression ○ Querying and Aggregation techniques ● Automatic failover ● Distributed clock synchronization ● Multi data center support ● Integration plugins with other databases ● Community
  • 13. CAP - Choose any 2 Consistency - Consistent view of dataset Availability - Read and write at any time Partitioning - Split data across machines
  • 17.
  • 19. Key Value Database - Riak (AP)
  • 21. Column Family Databases - Physical layout
  • 22. Column Family - Cassandra (AP)
  • 23. Column Family - HBase (CP)
  • 25. Document Database - MongoDB (CP)
  • 27. Graph Database - TitanDB layer (CP/CA/AP)
  • 30. Computing Engines - HDFS and Spark
  • 31. Computing Engines - MapR Stack
  • 32. NewSQL (F1 and) Google Spanner (2012) http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi201 2.pdf Spanner : Becoming a SQL System (2017) https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf “ There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. ” —C.A.R. Hoare
  • 33. Open Source Spanner - CockroachDB (CP)
  • 35. Ask Contextual Questions Do you really can’t afford hosting all of the data on Single Machine ? Is your data highly connected or independent ? Is your Primary workload OLTP (CP) or OLAP (CA) ? Are your customers geographically distributed ? Do you need to coordinate and scale business services without overwhelming primary data store? How much latency are you aiming for ? How much can you compromise on it ? How much are you willing to spend on infrastructure cost ? What’s the skill competency level of the dev team ? What is your target Time to Market SLA for new or changing features ?
  • 36. Accept Trade Offs Connected (Graph ) or Relational (SQL) Data Availability Storage Space(Volatile or Persistent) Data Encryption Range Sharding Synchronous RPC Batch Processing Embedded Computation Engine Independent Data Consistency Computation Computation Hash Sharding Async and Reactive Stream Processing Separate Computation Engine VS
  • 37. My (Biased) Recommendation MongoDB for medium complex load and developer productivity CockroachDB as Primary database Large OLTP and OLAP loads - ScyllaDB (Hash Sharding) and MapR-DB (Range Sharding) Druid for real time OLAP Titan / Janus Graph and ScyllaDB for (highly connected) graph data Redis HA Cluster for short lived distributed data structures Kafka or Pulsar as Distributed queue/buffer Prefer MapR as Hadoop platform