SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
BUILDING MODERN DATA
LAKES
Minio, Spark and Unified Data Architecture work in unison
By Ravi Shankar, October 2018
10/29/18
1
FIRST ORDER LOGIC
10/29/18
2
First-order logic—also known as first-order predicate calculus and predicate logic - is a collection
of formal systems used in mathematics, philosophy, linguistics, and computer science.
Married("Harry", "Sally", "12-Dec-1995").
IsMotherOf("Sally", "Peter").
IsFatherOf("Harry", "Peter").
The Relational Model says that in your
database this is how you think about and
represent all your data
There exists one or more X such that the
marriage happened in 1995
THE DATA MODELS.
10/29/18
3
subject-oriented, integrated, time-
variant and non-volatile collection
of data
integrating data marts into a
dimensional model for
consumption
PROBLEM STATEMENT.
10/29/18
4
Earlier New Digitalization Initiatives !!!
1. Change everything
2. Keep as is. Add new relations
3. Move to:
CTO GETS HADOOP IN.
10/29/18
5
1. Scale out architecture : 2. Shared Nothing : 3. Compute + Storage together 4. Google like!!
10/29/18
6
ALL WENT SMOOTH UNTIL...
A zip file was sent from a third part vendor which contains one million jpeg files. Wrote a map
reduce program to process it
File is of size 8 GB, separated into 128MB blocks – about 63 blocks. 3 times replication
- total size about 26 GB
Executed the application – What might have happened ?
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/Hado
opRDD.scala
10/29/18
7
SPLITTABILITY IMPORTANCE.
So, performance is not guaranteed in all scenarios with existing distributed technologies
10/29/18
8
THE SUBSEQUENT MONTHS.
1. We copied data from Netezza to HIVE
2. We created reports from Tableau with HIVE ODBC
3. We created a copy of HIVE into HBASE
4. We have HDP, but Cloudera supports Impala
5. MapReduce is slow
6. All data is not at one place
7. May be some more tools are needed
8. We need a unified data architecture solution
9. Rebalancing took entire week
10. Important file types are not splittable
11. 3 copies is too much space
12. Cost of maintenance is high
13. We may need to go to cloud
14. SLA not met
15. Too much operational work
10/29/18
9
WHAT MAKES ORGANIZATION FAMOUS?
CTO wants AI but AI is different from AI !!
AI : Autonomous systems which REPLACES human cognitive thought process
AI (IA): Autonomous systems which SUPPORTS human cognitive thought process
AlgorithmInput Output ?
Both needs machine learning and deep learning. These are means to do AI or IA
Algorithm ?
Input
Output
OUTPUT MAY BE NEEDED INSTANLY, BUT LEARNING IT
MAY TAKE HOURS/DAYS/MONTHS
7
Inputs Output Layer“Hidden” Layer(s)
FILE SYSTEMS.
• The problem is the file system. Traditional block-based file systems use
lookup tables to store file locations. They break each file up into small
blocks, generally 4k in size, and store the byte offset of each block in a
large table.
• This is fine for small volumes, but when you attempt to scale to the
petabyte range, these lookup tables become extremely large. It’s like
a database. The more rows you insert, the slower your queries
run. Eventually your performance degrades to the point where your
file system becomes unusable.
• When this happens, users are forced to split their data sets up into
multiple LUNs to maintain an acceptable level of performance. This
adds complexity and makes these systems difficult to manage
29/10/18
11
BLOCK BASED STORAGE SYSTEMS.
• To solve this problem, some organizations are deploying scale-out file
systems, like HDFS. This fixes the scalability problem, but keeping these
systems up and running is a labor-intensive process.
• Scale-out file systems are complex and require constant
maintenance. In addition, most of them rely on replication to protect
your data. The standard configuration is triple-replication, where you
store 3 copies of every file.
• This requires an extra 200% of raw disk capacity for
overhead! Everyone thinks that they’re saving money by using
commodity drives, but by the time you store three full copies of your
data set, the cost savings disappears. When we’re talking about
petabyte-scale applications, this is an expensive approach.
29/10/18
12
SOLUTION TO STORAGE.
• Object stores achieve their scalability by decoupling file management
from the low-level block management. Each disk is formatted with a
standard local file system, like ext4. Then a set of object storage
services is layered on top of it, combining everything into a single,
unified volume.
• Files are stored as “objects” in the object store rather than files on a
file system. By offloading the low-level block management onto the
local file systems, the object store only has to keep track of the high-
level details.
• This layer of separation keeps the file lookup tables at a manageable
size, allowing you scale to hundreds of petabytes without
experiencing degraded performance.
29/10/18
13
SOLUTION TO STORAGE.
• To maximize usable space, object stores use a technique called
Erasure Coding to protect your data. You can think of it as the next
generation of RAID.
• In an erasure coded volume, files are divided into shards, with each
shard being placed on a different disk. Additional shards are added,
containing error correction information, which provide protection from
data corruption and disk failures. Only a subset of the shards is
required to retrieve each file, which means it can survive multiple disk
failures without the risk of data loss.
• Erasure coded volumes can survive more disk failures than RAID and
typically provides more than double the usable capacity of triple
replication, making it the ideal choice for petabyte-scale storage.
29/10/18
14
MINIO - ERASURE CODING.
29/10/18
15
• EC is based on a technology called Forward Error Correction
(FEC), developed more than 50 years ago (1940- Richard
Hamming). Used originally for controlling errors in data
transmission over noisy or unreliable tele communication
channels. Reed-Solomon codes are a kind of EC, used widely in
CDs/DVDs, Blue Ray, Satellite commn etc.
• A message of k symbols can be transformed into a longer
message (code word or parity) with n symbols such that
the original message can be recovered from a subset of
the n symbols. If n=k+1, then there is a special case
called parity check
MINIO - ERASURE CODING.
29/10/18
16
TOP 5 : COST.
• https://amzn.to/2Q7AWGo
• S3: 23 USD per TB per month.(12.5 USD per TB for cold access)
• HDFS: Using d2.8xl instance types ($5.52/hr with 71% discount, 48TB
HDD), it costs 5.52 x 0.29 x 24 x 30 / 48 x 3 / 0.7 = $103/month for 1TB of
data. (Note that with reserved instances, it is possible to achieve lower
price on the d2 family.)
• S3 is 5X cheaper than HDFS.
• S3’s human cost is virtually zero, whereas it usually takes a team of
Hadoop engineers or vendor support to maintain HDFS. Once we
factor in human cost, S3 is 10X cheaper than HDFS clusters on EC2 with
comparable capacity.
29/10/18
17
TOP 5 : ELASTICITY.
• From Databricks:
• 99.999999999% durability and 99.99% availability. Note that this is
higher than the vast majority of organizations’ in-house services.
• Majority of Hadoop clusters have availability lower than 99.9%, i.e. at
least 9 hours of downtime per year.
• With cross-AZ replication that automatically replicates across different
data centers, S3’s availability and durability is far superior to HDFS’.
• Hortonworks – Data Plane Services in 2019!
29/10/18
18
TOP 5 : PERFORMANCE.
• When using HDFS and getting perfect data locality, it is possible to get
~3GB/node local read throughput on some of the instance types (e.g.
i2.8xl, roughly 90MB/s per core). Spark DBIO, cloud I/O optimization
module, provides optimized connectors to S3 and can sustain
~600MB/s read throughput on i2.8xl (roughly 20MB/s per core).
• That is to say, on a per node basis, HDFS can yield 6X higher read
throughput than S3. Thus, given that the S3 is 10x cheaper than HDFS,
we find that S3 is almost 2x better compared to HDFS on performance
per dollar.
29/10/18
19
TOP 5 :TRANSACTIONS.
• Hadoop fs –mkdirs sample/a/b/c/
• Now you put the file into a/b/c
• Buckets…not directories
• In a Minio server instance, a single RESTful PUT request will create an
object “a/b/c/data.txt” in “mybucket” without having to create
“a/b/c” in advance
• This happens because object stores support hierarchical naming and
operations without the need for directories.
29/10/18
20
TOP 5 :TRANSACTIONS.
• Data Move is very interesting…
• What happens if you have a write code in Spark (saveAsTextFile) fils for
a partition ?
• Rename is atomic – the most critical part in Hadoop write flow
• Minio (or any object store) does not provide an atomic rename. In
fact, rename should be avoided in object storage altogether, since it
consists of two separate operations: copy and delete.
• Normal COPY is mapped to RESTful PUT request or RESTful COPY
request and triggers internal data movements between storage
nodes. The subsequent delete command maps to the RESTful DELETE
request, but usually relies on the bucket listing operation to identify
which data must be deleted. This makes a rename highly inefficient in
object stores, and the lack of atomicity may leave data in a
corrupted state.
29/10/18
21
TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
22
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
23
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
TOP 5 : DATA INTEGRITY - ELEGANT
SOLUTION FROM SPARK.
29/10/18
24
• Version 2.1 : https://docs.databricks.com/spark/latest/spark-sql/dbio-
commit.html
SO HOW WILL IT LOOK LIKE?
29/10/18
25
COMPARISON.
MINIO
$
99.99 %
99.999999999$
DBIO
YES
HDFS
$$
99.9%
99.9999% (Estimated)
YES
NO
MINIO VS HDFS
10x
10x
10x
COMPARABLE
MINIO IS ELASTIC
10/29/18
26
FEATURE
COST/TB/
MONTH
AVLBLTY
DURABLE
WRITES
ELASTICITY
MINIO.
29/10/18
27
• High performance distributed Object Storage Server
• Simple, Efficient, Light weight and no learning curves
DEMO TIME
1) MINIO INTEROPERABILITY WITH HADOOP – PUTTING AND GETTING DATA
2) MINIO INTEROPERABILITY WITH HIVE
3) MINIO WITH UNIFIED DATA ARCHITECTURE – PRESTO
4) MINIO WITH SPARK - FILES
5) MINIO WITH SPARK – OBJECTS
6) MINIO WITH SEARCH
10/29/18
28
SUMMARY: EARN BY THIS ARCHITECTURE
10/29/18
29
THANK YOU!
Refer to:
https://blog.minio.io/modern-data-lake-with-minio-part-1-716a49499533
https://blog.minio.io/modern-data-lake-with-minio-part-2-f24fb5f82424
https://www.minio.io/
Apache Spark
Presto
10/29/18
30
QUESTIONS?

Contenu connexe

Tendances

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Session découverte de la Data Virtualization
Session découverte de la Data VirtualizationSession découverte de la Data Virtualization
Session découverte de la Data VirtualizationDenodo
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 

Tendances (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Session découverte de la Data Virtualization
Session découverte de la Data VirtualizationSession découverte de la Data Virtualization
Session découverte de la Data Virtualization
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 

Similaire à Building modern data lakes

Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
S016825 ibm-cos-nola-v1710d
S016825 ibm-cos-nola-v1710dS016825 ibm-cos-nola-v1710d
S016825 ibm-cos-nola-v1710dTony Pearson
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
S100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cS100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cTony Pearson
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraAlluxio, Inc.
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataLviv Startup Club
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Lviv Startup Club
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 

Similaire à Building modern data lakes (20)

getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Storage
StorageStorage
Storage
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
S016825 ibm-cos-nola-v1710d
S016825 ibm-cos-nola-v1710dS016825 ibm-cos-nola-v1710d
S016825 ibm-cos-nola-v1710d
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
S100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cS100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804c
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big Data
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 

Dernier

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 

Dernier (20)

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 

Building modern data lakes

  • 1. BUILDING MODERN DATA LAKES Minio, Spark and Unified Data Architecture work in unison By Ravi Shankar, October 2018 10/29/18 1
  • 2. FIRST ORDER LOGIC 10/29/18 2 First-order logic—also known as first-order predicate calculus and predicate logic - is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. Married("Harry", "Sally", "12-Dec-1995"). IsMotherOf("Sally", "Peter"). IsFatherOf("Harry", "Peter"). The Relational Model says that in your database this is how you think about and represent all your data There exists one or more X such that the marriage happened in 1995
  • 3. THE DATA MODELS. 10/29/18 3 subject-oriented, integrated, time- variant and non-volatile collection of data integrating data marts into a dimensional model for consumption
  • 4. PROBLEM STATEMENT. 10/29/18 4 Earlier New Digitalization Initiatives !!! 1. Change everything 2. Keep as is. Add new relations 3. Move to:
  • 5. CTO GETS HADOOP IN. 10/29/18 5 1. Scale out architecture : 2. Shared Nothing : 3. Compute + Storage together 4. Google like!!
  • 6. 10/29/18 6 ALL WENT SMOOTH UNTIL... A zip file was sent from a third part vendor which contains one million jpeg files. Wrote a map reduce program to process it File is of size 8 GB, separated into 128MB blocks – about 63 blocks. 3 times replication - total size about 26 GB Executed the application – What might have happened ? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/Hado opRDD.scala
  • 7. 10/29/18 7 SPLITTABILITY IMPORTANCE. So, performance is not guaranteed in all scenarios with existing distributed technologies
  • 8. 10/29/18 8 THE SUBSEQUENT MONTHS. 1. We copied data from Netezza to HIVE 2. We created reports from Tableau with HIVE ODBC 3. We created a copy of HIVE into HBASE 4. We have HDP, but Cloudera supports Impala 5. MapReduce is slow 6. All data is not at one place 7. May be some more tools are needed 8. We need a unified data architecture solution 9. Rebalancing took entire week 10. Important file types are not splittable 11. 3 copies is too much space 12. Cost of maintenance is high 13. We may need to go to cloud 14. SLA not met 15. Too much operational work
  • 9. 10/29/18 9 WHAT MAKES ORGANIZATION FAMOUS? CTO wants AI but AI is different from AI !! AI : Autonomous systems which REPLACES human cognitive thought process AI (IA): Autonomous systems which SUPPORTS human cognitive thought process AlgorithmInput Output ? Both needs machine learning and deep learning. These are means to do AI or IA Algorithm ? Input Output OUTPUT MAY BE NEEDED INSTANLY, BUT LEARNING IT MAY TAKE HOURS/DAYS/MONTHS
  • 11. FILE SYSTEMS. • The problem is the file system. Traditional block-based file systems use lookup tables to store file locations. They break each file up into small blocks, generally 4k in size, and store the byte offset of each block in a large table. • This is fine for small volumes, but when you attempt to scale to the petabyte range, these lookup tables become extremely large. It’s like a database. The more rows you insert, the slower your queries run. Eventually your performance degrades to the point where your file system becomes unusable. • When this happens, users are forced to split their data sets up into multiple LUNs to maintain an acceptable level of performance. This adds complexity and makes these systems difficult to manage 29/10/18 11
  • 12. BLOCK BASED STORAGE SYSTEMS. • To solve this problem, some organizations are deploying scale-out file systems, like HDFS. This fixes the scalability problem, but keeping these systems up and running is a labor-intensive process. • Scale-out file systems are complex and require constant maintenance. In addition, most of them rely on replication to protect your data. The standard configuration is triple-replication, where you store 3 copies of every file. • This requires an extra 200% of raw disk capacity for overhead! Everyone thinks that they’re saving money by using commodity drives, but by the time you store three full copies of your data set, the cost savings disappears. When we’re talking about petabyte-scale applications, this is an expensive approach. 29/10/18 12
  • 13. SOLUTION TO STORAGE. • Object stores achieve their scalability by decoupling file management from the low-level block management. Each disk is formatted with a standard local file system, like ext4. Then a set of object storage services is layered on top of it, combining everything into a single, unified volume. • Files are stored as “objects” in the object store rather than files on a file system. By offloading the low-level block management onto the local file systems, the object store only has to keep track of the high- level details. • This layer of separation keeps the file lookup tables at a manageable size, allowing you scale to hundreds of petabytes without experiencing degraded performance. 29/10/18 13
  • 14. SOLUTION TO STORAGE. • To maximize usable space, object stores use a technique called Erasure Coding to protect your data. You can think of it as the next generation of RAID. • In an erasure coded volume, files are divided into shards, with each shard being placed on a different disk. Additional shards are added, containing error correction information, which provide protection from data corruption and disk failures. Only a subset of the shards is required to retrieve each file, which means it can survive multiple disk failures without the risk of data loss. • Erasure coded volumes can survive more disk failures than RAID and typically provides more than double the usable capacity of triple replication, making it the ideal choice for petabyte-scale storage. 29/10/18 14
  • 15. MINIO - ERASURE CODING. 29/10/18 15 • EC is based on a technology called Forward Error Correction (FEC), developed more than 50 years ago (1940- Richard Hamming). Used originally for controlling errors in data transmission over noisy or unreliable tele communication channels. Reed-Solomon codes are a kind of EC, used widely in CDs/DVDs, Blue Ray, Satellite commn etc. • A message of k symbols can be transformed into a longer message (code word or parity) with n symbols such that the original message can be recovered from a subset of the n symbols. If n=k+1, then there is a special case called parity check
  • 16. MINIO - ERASURE CODING. 29/10/18 16
  • 17. TOP 5 : COST. • https://amzn.to/2Q7AWGo • S3: 23 USD per TB per month.(12.5 USD per TB for cold access) • HDFS: Using d2.8xl instance types ($5.52/hr with 71% discount, 48TB HDD), it costs 5.52 x 0.29 x 24 x 30 / 48 x 3 / 0.7 = $103/month for 1TB of data. (Note that with reserved instances, it is possible to achieve lower price on the d2 family.) • S3 is 5X cheaper than HDFS. • S3’s human cost is virtually zero, whereas it usually takes a team of Hadoop engineers or vendor support to maintain HDFS. Once we factor in human cost, S3 is 10X cheaper than HDFS clusters on EC2 with comparable capacity. 29/10/18 17
  • 18. TOP 5 : ELASTICITY. • From Databricks: • 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. • Majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year. • With cross-AZ replication that automatically replicates across different data centers, S3’s availability and durability is far superior to HDFS’. • Hortonworks – Data Plane Services in 2019! 29/10/18 18
  • 19. TOP 5 : PERFORMANCE. • When using HDFS and getting perfect data locality, it is possible to get ~3GB/node local read throughput on some of the instance types (e.g. i2.8xl, roughly 90MB/s per core). Spark DBIO, cloud I/O optimization module, provides optimized connectors to S3 and can sustain ~600MB/s read throughput on i2.8xl (roughly 20MB/s per core). • That is to say, on a per node basis, HDFS can yield 6X higher read throughput than S3. Thus, given that the S3 is 10x cheaper than HDFS, we find that S3 is almost 2x better compared to HDFS on performance per dollar. 29/10/18 19
  • 20. TOP 5 :TRANSACTIONS. • Hadoop fs –mkdirs sample/a/b/c/ • Now you put the file into a/b/c • Buckets…not directories • In a Minio server instance, a single RESTful PUT request will create an object “a/b/c/data.txt” in “mybucket” without having to create “a/b/c” in advance • This happens because object stores support hierarchical naming and operations without the need for directories. 29/10/18 20
  • 21. TOP 5 :TRANSACTIONS. • Data Move is very interesting… • What happens if you have a write code in Spark (saveAsTextFile) fils for a partition ? • Rename is atomic – the most critical part in Hadoop write flow • Minio (or any object store) does not provide an atomic rename. In fact, rename should be avoided in object storage altogether, since it consists of two separate operations: copy and delete. • Normal COPY is mapped to RESTful PUT request or RESTful COPY request and triggers internal data movements between storage nodes. The subsequent delete command maps to the RESTful DELETE request, but usually relies on the bucket listing operation to identify which data must be deleted. This makes a rename highly inefficient in object stores, and the lack of atomicity may leave data in a corrupted state. 29/10/18 21
  • 22. TOP 5 :TRANSACTIONS: PERFORMANCE. 29/10/18 22 • version 1, which moves staged task output files to their final locations at the end of the job, and version 2, which moves files as individual job tasks complete.
  • 23. TOP 5 :TRANSACTIONS: PERFORMANCE. 29/10/18 23 • version 1, which moves staged task output files to their final locations at the end of the job, and version 2, which moves files as individual job tasks complete.
  • 24. TOP 5 : DATA INTEGRITY - ELEGANT SOLUTION FROM SPARK. 29/10/18 24 • Version 2.1 : https://docs.databricks.com/spark/latest/spark-sql/dbio- commit.html
  • 25. SO HOW WILL IT LOOK LIKE? 29/10/18 25
  • 26. COMPARISON. MINIO $ 99.99 % 99.999999999$ DBIO YES HDFS $$ 99.9% 99.9999% (Estimated) YES NO MINIO VS HDFS 10x 10x 10x COMPARABLE MINIO IS ELASTIC 10/29/18 26 FEATURE COST/TB/ MONTH AVLBLTY DURABLE WRITES ELASTICITY
  • 27. MINIO. 29/10/18 27 • High performance distributed Object Storage Server • Simple, Efficient, Light weight and no learning curves
  • 28. DEMO TIME 1) MINIO INTEROPERABILITY WITH HADOOP – PUTTING AND GETTING DATA 2) MINIO INTEROPERABILITY WITH HIVE 3) MINIO WITH UNIFIED DATA ARCHITECTURE – PRESTO 4) MINIO WITH SPARK - FILES 5) MINIO WITH SPARK – OBJECTS 6) MINIO WITH SEARCH 10/29/18 28
  • 29. SUMMARY: EARN BY THIS ARCHITECTURE 10/29/18 29