Handwritten Text Recognition for manuscripts and early printed texts
10X your HDFS with NDB
1. HopsFS: 10X your HDFS with NDB
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
Oracle, Stockholm, 6th September 2016
www.hops.io
@hopshadoop
2. Hops Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Ermias Gebremeskel, Antonios Kouzoupis.
Alumni: Vasileios Giannokostas, Misganu Dessalegn,
Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,
K “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
3. Marketing 101: Celebrity Endorsements
*Turing Award Winner 2014, Father of Distributed Systems
Hi!
I’m Leslie Lamport* and
even though you’re not
using Paxos, I approve
this product.
9. Max Pause times for NameNode Heap Sizes*
9
Max Pause-Times
(ms)
100
1000
10000
10
JVM Heap Size (GB)
50 75 100 150
*OpenJDK or Oracle JVM
10. NameNode and Decreasing Memory Costs
10
Size (GB)
250
500
1000
Year
2016 2017 2018 2019 2020
0
750
11. Externalizing the NameNode State
•Problem:
NameNode not scaling up with lower RAM prices
•Solution:
Move the metadata off the JVM Heap
•Move it where?
An in-memory storage system that can be efficiently
queried and managed. Preferably Open-Source.
•MySQL Cluster (NDB)
11
13. Pluggable DBs: Data Abstraction Layer (DAL)
13
NameNode
(Apache v2)
DAL API
(Apache v2)
NDB-DAL-Impl
(GPL v2)
Other DB
(Other License)
hops-2.5.0.jar dal-ndb-2.5.0-7.5.3.jar
17. Concurrency Model: Implicit Locking
• Serializabile FS ops using implicit locking of subtrees.
17
[Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]
18. Preventing Deadlock and Starvation
•Acquire FS locks in agreed order using FS Hierarchy.
•Block-level operations follow the same agreed order.
•No cycles => Freedom from deadlock
•Pessimistic Concurrency Control ensures progress
18
/user/jim/myFilemv
read
block_report
DataNodeNameNodeClient
19. Per Transaction Cache
•Reusing the HDFS codebase resulted in too many
roundtrips to the database per transaction.
•We cache intermediate transaction results at
NameNodes (i.e., snapshot).
20. Sometimes, Transactions Just ain’t Enough
•Large Subtree Operations (delete, mv, set-quota)
can’t always be executed in a single Transaction.
•4-phase Protocol
• Isolation and Consistency
• Aggressive batching
• Transparent failure handling
• Failed ops retried on new NN.
• Lease timeout for failed clients.
20
21. Leader Election using NDB
•Leader to coordinate replication/lease management
•NDB as shared memory for Leader Election of NN.
21
[Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015]
22. Path Component Caching
•The most common operation in HDFS is resolving
pathnames to inodes
- 67% of operations in Spotify’s Hadoop workload
•We cache recently resolved inodes at NameNodes so
that we can resolve them using a single batch
primary key lookup.
- We validate cache entries as part of transactions
- The cache converts O(N) round trips to the database to
O(1) for a hit for all inodes in a path.
22
23. Path Component Caching
•Resolving a path of length N gives O(N) round-trips
•With our cache, O(1) round-trip for a cache hit
/user/jim/myFile
NDB
getInode
(0, “user”) getInode
(1, “jim”) getInode
(2, “myFile”)
NameNode
/user/jim/myFile
NDB
validateInodes
([(0, “user”),
(1,”jim”),
(2,”myFile”)])
NameNode
Cache
getInodes(“/user/jim/myFile”)
24. Hotspots
•Mikael saw 1-2 maxed out LDM threads
•Partitioning by parent inodeId meant
fantastic performance for ‘ls’
- Partition-pruned index scans
- At high load hotspots appeared at the
top of the directory hierarchy
•Current Solution:
- Cache the root inode at NameNodes
- Pseudo-random partition key for top-level
directories, but keep partition by parent
inodeId at lower levels
- At least 4x throughput increase!
24
/
/Users /Projects
/NSA /MyProj
/Dataset1 /Dataset2
25. Scalable Blocking Reporting
•On 100PB+ clusters, internal maintenance protocol
traffic makes up much of the network traffic
•Block Reporting
- Leader Load Balances
- Work-steal when exiting
safe-mode
SafeBlocks
DataNodes
NameNodes
NDB
Leader
Blocks
work steal
33. NDB Performance Lessons
•NDB is quite stable!
•ClusterJ is (nearly) good enough
- sun.misc.Cleaner has trouble keeping up at high
throughput – OOM for ByteBuffers
- Transaction hint behavior not respected
- DTO creation time affected by Java Reflection
- Nice features would be:
• Projections
• Batched scan operations support
• Event API
•Event API and Asynchronous API needed for
performance in Hops-YARN
33
34. Heterogeneous Storage in HopsFS
34
•Storage Types in HopsFS: Default, EC-RAID5, SSD
- Default: 3X overhead - triple replication on spinning disks
- SSD: 3X overhead - triple replication on SSDs
- EC-RAID5: 1.4X overhead with low reconstruction overhead!
37. ePipe: Indexing HopsFS’ Namespace
37
Free-Text
Search
NDBElasticSearch
Polyglot Persistence
The Distributed Database is the Single Source of Truth.
Foreign keys ensure the integrity of Extended Metadata.
MetaData
Designer
MetaData
Entry
NDB Event API
40. NDB
ResourceManager– Monolithic but Modular
40
ApplicationMaster
Service
ResourceTracker
Service
Scheduler
Client
Service
YARN Client
Admin
Service
Security
Cluster State
HopsResourceTracker
Cluster State
HopsScheduler
NodeManagerNodeManagerYARN Client App MasterApp Master
ResourceManager
~2k ops/s ~10k ops/s
ClusterJ Event API
43. Hopsworks – Project-Based Multi-Tenancy
•A project is a collection of
- Users with Roles
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•Per-Project quotas
- Storage in HDFS
- CPU in YARN
• Uber-style Pricing
•Sharing across Projects
- Datasets/Topics
43
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
47. Summary
•HopsFS is the world’s fastest, most scalable HDFS
implementation
•Powered by NDB, the world’s fastest database
•Thanks to Mikael, Craig, Frazer, Bernt and others
•Still room for improvement….
47
www.hops.io
I am going to talk about realizing Bill Gate’s vision for a filesystem in the Hadoop Ecosystem.
“WinFS was an attempt to bring the benefits of schema and relational databases to the Windows file system. …The WinFS effort was started around 1999 as the successor to the planned storage layer of Cairo and died in 2006 after consuming many thousands of hours of efforts from really smart engineers.”
[Brian Welcker]**
**http://blogs.msdn.com/b/bwelcker/archive/2013/02/11/the-vision-thing.aspx
The kind of challenges you have with the NN are managing large clusters and configuring the NN.
Slope of the Bottom Line is based on improvements in garbage collection technology – Azul JVM, Shenndowagh, etc
Slope of the top line is based on Moore’s Law.
Apache Spark already moving in this direction – Tachyon
The NameNode has multi-reader, single writer concurrency semantics.
Operations that would hold the write lock for too long, starving clients, are not executed atomically. For example, deleting a directory subtree with millions of files, involves deleting batches of files, yielding the global lock for a period, then re-acquiring it, to continue the operation.
With global lock, it’s easy.
If something is not atomic, you have to handl all possible failures
Only new Protocol Buffer Message we added to DNs
reconstruction read is expensive
The Resource Manager (RM) is a bottleneck.
Zookeeper throughput not high enough to persist all RM state
Standby resource manager can only recover partial state
All running jobs must be restarted.
RM state not queryable.
The RM is a State-Machine. Almost no session state to manage.
Privileges – upload/download data, run analysis jobs
Like RBAC solution.
All access via HopsWorks.