Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Hadoop enhancements using next gen IA technologies
1. Hadoop Enhancements Using Next-
Gen Intel ® Platform Technologies
Anoop Sam John – PMC member for Apache HBase
Rakesh R - Committer for Apache Zookeeper and PMC member for Apache Bookkeeper
3. About Us
• Anoop Sam John
• PMC member for Apache HBase and Phoenix
• anoopsamjohn@apache.org
• https://www.linkedin.com/in/anoopsamjohn
• Rakesh R
• Committer for Apache ZooKeeper and BookKeeper
• Apache Hadoop contributor
• rakeshr@apache.org
• https://www.linkedin.com/in/rakeshadr
4. Agenda
• Intel Enhancements on Hadoop platform
• HDFS
Erasure coding using ISA-L library
Encryption using AES-NI
• HBase
Go Big Cache
5. HDFS – Distributed FileSystem
“Between the birth of the world and
2003, there were 5 Exabytes of
information created. We now create
5 Exabytes every two days.”
Eric Schmidt, Executive
Chairman of Alphabet, Inc.
6. HDFS – Current Replication Strategy
• Inherits 3-way replication from Google File System to increase data availability
- 3x storage overhead
• Expensive for,
- Massive amount of data
- Geo-distributed data recovery Datanode1
r1
Datanode2
r2
DFSClient
r3
Rack-1 Rack-2
3X
replication
7. HDFS – Erasure Coding
• k data blocks + m parity blocks (k + m)
Example: Reed-Solomon 6 + 3
• Save disk space
• 1.5x storage overhead
X Y X Y
0 0 0
0 1 1
1 0 1
1 1 0
data bits parity bits
Sample codec library
(XOR based)
b1 b2 b3 b4 b5 b6 b7 b8 b9
6 data blocks 3 parity blocks
D1 D2 D3 D4 D5 D6 D7 D8 D9
8. Durability & Efficiency
3-way data replication Erasure coding : RS – (6,3)
Data Durability 2 3
Storage efficiency 1/3 (33.33%) 6/9 (67%)
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
useful data extra data
useful data
Datanode1 Datanode2 Datanode3
Replica1 Replica2 Replica3
redundant data
3-way data replication
D1 D2 D3 D4 D5 D6 D7 D8 D9
b1 b2 b3 b4 b5 b6 b7 b8 b9
6 data blocks 3 parity blocks
RS-(6,3) Erasure coding
• Released version – Apache Hadoop 3.0.0-alpha1
9. Microbenchmark : Codec Calculation
MBpersecond
Image courtesy Cloudera
• New Intel architecture solutions for storage (ISA-L)
• Intel® Intelligent Storage Acceleration Library provides a solution to deploy EC
with better performance.
https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version
11. HDFS-Encryption
• Sensitivity of the data and
managing privacy of the data
is very important for the big
data analytics
• Encryption is a regulatory
requirement for many
business sectors
- Finance
- Government
- Healthcare etc.
DFSClient
Per-file key
operations
Data opsRead/Write
encrypted
data
KMS
HDFS Cluster
Encryption key ops
data at-rest
Disk
Encryption library
• Released version – Apache Hadoop 2.6.0
12. Encryption Algorithm
• Data encryption/decryption is costlier
• Encryption ciphers.
AES-CTR (Advanced Encryption Standard - Counter Mode) is most popular
Either 128 or 192 or 256 bit keys
13. Encryption AES-CTR
• Two implementations of AES-CTR
1. JCE (Java Cryptography Extension) software implementation
2. OpenSSL hardware accelerated AES-NI (Intel ® Advanced Encryption
Standard New Instructions) implementation
AES-NI available in Westmere(2010) and newer Intel CPUs
AES-NI further optimized in Haswell(2013)
https://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni
14. Microbenchmark: Encrypt/Decrypt 1GB Byte array
Test Environment:
Run locally on a single Haswell machine
Single threaded, excluded HDFS overheads(checksumming, network, copies)
Image courtesy Cloudera
15. Apache Commons Crypto
• Cryptographic layer is incubated as new Apache component
http://commons.apache.org/proper/commons-crypto/
• Apache Commons crypto was integrated with Apache Spark as well
for shuffle encryption.
17. HBase
• NoSQL database in Hadoop eco
• Accumulates writes in memory and flushes to HDFS
• Caches data
• Reduced read latency
• Better read throughput
• Memory hungry processes
18. Big Memory
• Hadoop platforms no longer only for commodity hardware.
• Systems moving towards faster CPU and bigger memories
Big Data => Big Storage + Big Memory
• Non Volatile memory technology
• 3D XPoint™ DIMMS from Intel®
• Higher memory capability
• Lower cost vs DDR
19. HBase – Go Big Cache
Data Data Data Data Data Data
HBase
JVM Offheap
memory
Client
HDFS
Cache
Reads
Reads
• JVM GC tuning continues to
be a challenge with larger
heaps (new GC algos)
• Much bigger sized cache in
offheap memory for faster
random reads
• Better predictable latency
• Building blocks for
supporting 3D XPoint™
products
20. HBase – Go Big Cache
Performance before offheaping
Performance after offheaping
Image courtesy Alibaba Inc.
• Alibaba adopted this feature for their 1600 node cluster.
• Used in double 11 online sale