Modern data lakes are now built on cloud storage, helping organizations leverage the scale and economics of object storage while simplifying overall data storage and analysis flow
2. FIRST ORDER LOGIC
10/29/18
2
First-order logic—also known as first-order predicate calculus and predicate logic - is a collection
of formal systems used in mathematics, philosophy, linguistics, and computer science.
Married("Harry", "Sally", "12-Dec-1995").
IsMotherOf("Sally", "Peter").
IsFatherOf("Harry", "Peter").
The Relational Model says that in your
database this is how you think about and
represent all your data
There exists one or more X such that the
marriage happened in 1995
5. CTO GETS HADOOP IN.
10/29/18
5
1. Scale out architecture : 2. Shared Nothing : 3. Compute + Storage together 4. Google like!!
6. 10/29/18
6
ALL WENT SMOOTH UNTIL...
A zip file was sent from a third part vendor which contains one million jpeg files. Wrote a map
reduce program to process it
File is of size 8 GB, separated into 128MB blocks – about 63 blocks. 3 times replication
- total size about 26 GB
Executed the application – What might have happened ?
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/Hado
opRDD.scala
8. 10/29/18
8
THE SUBSEQUENT MONTHS.
1. We copied data from Netezza to HIVE
2. We created reports from Tableau with HIVE ODBC
3. We created a copy of HIVE into HBASE
4. We have HDP, but Cloudera supports Impala
5. MapReduce is slow
6. All data is not at one place
7. May be some more tools are needed
8. We need a unified data architecture solution
9. Rebalancing took entire week
10. Important file types are not splittable
11. 3 copies is too much space
12. Cost of maintenance is high
13. We may need to go to cloud
14. SLA not met
15. Too much operational work
9. 10/29/18
9
WHAT MAKES ORGANIZATION FAMOUS?
CTO wants AI but AI is different from AI !!
AI : Autonomous systems which REPLACES human cognitive thought process
AI (IA): Autonomous systems which SUPPORTS human cognitive thought process
AlgorithmInput Output ?
Both needs machine learning and deep learning. These are means to do AI or IA
Algorithm ?
Input
Output
OUTPUT MAY BE NEEDED INSTANLY, BUT LEARNING IT
MAY TAKE HOURS/DAYS/MONTHS
11. FILE SYSTEMS.
• The problem is the file system. Traditional block-based file systems use
lookup tables to store file locations. They break each file up into small
blocks, generally 4k in size, and store the byte offset of each block in a
large table.
• This is fine for small volumes, but when you attempt to scale to the
petabyte range, these lookup tables become extremely large. It’s like
a database. The more rows you insert, the slower your queries
run. Eventually your performance degrades to the point where your
file system becomes unusable.
• When this happens, users are forced to split their data sets up into
multiple LUNs to maintain an acceptable level of performance. This
adds complexity and makes these systems difficult to manage
29/10/18
11
12. BLOCK BASED STORAGE SYSTEMS.
• To solve this problem, some organizations are deploying scale-out file
systems, like HDFS. This fixes the scalability problem, but keeping these
systems up and running is a labor-intensive process.
• Scale-out file systems are complex and require constant
maintenance. In addition, most of them rely on replication to protect
your data. The standard configuration is triple-replication, where you
store 3 copies of every file.
• This requires an extra 200% of raw disk capacity for
overhead! Everyone thinks that they’re saving money by using
commodity drives, but by the time you store three full copies of your
data set, the cost savings disappears. When we’re talking about
petabyte-scale applications, this is an expensive approach.
29/10/18
12
13. SOLUTION TO STORAGE.
• Object stores achieve their scalability by decoupling file management
from the low-level block management. Each disk is formatted with a
standard local file system, like ext4. Then a set of object storage
services is layered on top of it, combining everything into a single,
unified volume.
• Files are stored as “objects” in the object store rather than files on a
file system. By offloading the low-level block management onto the
local file systems, the object store only has to keep track of the high-
level details.
• This layer of separation keeps the file lookup tables at a manageable
size, allowing you scale to hundreds of petabytes without
experiencing degraded performance.
29/10/18
13
14. SOLUTION TO STORAGE.
• To maximize usable space, object stores use a technique called
Erasure Coding to protect your data. You can think of it as the next
generation of RAID.
• In an erasure coded volume, files are divided into shards, with each
shard being placed on a different disk. Additional shards are added,
containing error correction information, which provide protection from
data corruption and disk failures. Only a subset of the shards is
required to retrieve each file, which means it can survive multiple disk
failures without the risk of data loss.
• Erasure coded volumes can survive more disk failures than RAID and
typically provides more than double the usable capacity of triple
replication, making it the ideal choice for petabyte-scale storage.
29/10/18
14
15. MINIO - ERASURE CODING.
29/10/18
15
• EC is based on a technology called Forward Error Correction
(FEC), developed more than 50 years ago (1940- Richard
Hamming). Used originally for controlling errors in data
transmission over noisy or unreliable tele communication
channels. Reed-Solomon codes are a kind of EC, used widely in
CDs/DVDs, Blue Ray, Satellite commn etc.
• A message of k symbols can be transformed into a longer
message (code word or parity) with n symbols such that
the original message can be recovered from a subset of
the n symbols. If n=k+1, then there is a special case
called parity check
17. TOP 5 : COST.
• https://amzn.to/2Q7AWGo
• S3: 23 USD per TB per month.(12.5 USD per TB for cold access)
• HDFS: Using d2.8xl instance types ($5.52/hr with 71% discount, 48TB
HDD), it costs 5.52 x 0.29 x 24 x 30 / 48 x 3 / 0.7 = $103/month for 1TB of
data. (Note that with reserved instances, it is possible to achieve lower
price on the d2 family.)
• S3 is 5X cheaper than HDFS.
• S3’s human cost is virtually zero, whereas it usually takes a team of
Hadoop engineers or vendor support to maintain HDFS. Once we
factor in human cost, S3 is 10X cheaper than HDFS clusters on EC2 with
comparable capacity.
29/10/18
17
18. TOP 5 : ELASTICITY.
• From Databricks:
• 99.999999999% durability and 99.99% availability. Note that this is
higher than the vast majority of organizations’ in-house services.
• Majority of Hadoop clusters have availability lower than 99.9%, i.e. at
least 9 hours of downtime per year.
• With cross-AZ replication that automatically replicates across different
data centers, S3’s availability and durability is far superior to HDFS’.
• Hortonworks – Data Plane Services in 2019!
29/10/18
18
19. TOP 5 : PERFORMANCE.
• When using HDFS and getting perfect data locality, it is possible to get
~3GB/node local read throughput on some of the instance types (e.g.
i2.8xl, roughly 90MB/s per core). Spark DBIO, cloud I/O optimization
module, provides optimized connectors to S3 and can sustain
~600MB/s read throughput on i2.8xl (roughly 20MB/s per core).
• That is to say, on a per node basis, HDFS can yield 6X higher read
throughput than S3. Thus, given that the S3 is 10x cheaper than HDFS,
we find that S3 is almost 2x better compared to HDFS on performance
per dollar.
29/10/18
19
20. TOP 5 :TRANSACTIONS.
• Hadoop fs –mkdirs sample/a/b/c/
• Now you put the file into a/b/c
• Buckets…not directories
• In a Minio server instance, a single RESTful PUT request will create an
object “a/b/c/data.txt” in “mybucket” without having to create
“a/b/c” in advance
• This happens because object stores support hierarchical naming and
operations without the need for directories.
29/10/18
20
21. TOP 5 :TRANSACTIONS.
• Data Move is very interesting…
• What happens if you have a write code in Spark (saveAsTextFile) fils for
a partition ?
• Rename is atomic – the most critical part in Hadoop write flow
• Minio (or any object store) does not provide an atomic rename. In
fact, rename should be avoided in object storage altogether, since it
consists of two separate operations: copy and delete.
• Normal COPY is mapped to RESTful PUT request or RESTful COPY
request and triggers internal data movements between storage
nodes. The subsequent delete command maps to the RESTful DELETE
request, but usually relies on the bucket listing operation to identify
which data must be deleted. This makes a rename highly inefficient in
object stores, and the lack of atomicity may leave data in a
corrupted state.
29/10/18
21
22. TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
22
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
23. TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
23
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
24. TOP 5 : DATA INTEGRITY - ELEGANT
SOLUTION FROM SPARK.
29/10/18
24
• Version 2.1 : https://docs.databricks.com/spark/latest/spark-sql/dbio-
commit.html
28. DEMO TIME
1) MINIO INTEROPERABILITY WITH HADOOP – PUTTING AND GETTING DATA
2) MINIO INTEROPERABILITY WITH HIVE
3) MINIO WITH UNIFIED DATA ARCHITECTURE – PRESTO
4) MINIO WITH SPARK - FILES
5) MINIO WITH SPARK – OBJECTS
6) MINIO WITH SEARCH
10/29/18
28