4. 4
Big Data: Distributed FileSystems
Volume, Variety, Velocity:
Can't have big data without a scalable filesystem
http://www.lbisoftware.com/blog/wp-content/uploads/2013/06/data_mountain1.jpg
6. 6
HDFS Architectural Flaws
● Created for storing crawled web-page data
● Files cannot be modified once written/closed.
– Write-once; append-only
● Files cannot be read before they are closed.
– Must batch-load data
● NameNode stores (in memory)
– Directory/file tree, file->block mapping
– Block replica locations
● NameNode only scales to ~100 Million files
– Some users run jobs to concatenate small files
● Written in Java, slows during GC.
7. 7
Solution: MapR FileSystem
● Visionary CTO/Co-Founder: M.C. Srivas
– Ran Google search infrastructure team
– Chief Storage Architect at Spinnaker Networks
● Take a step back: What kind of DFS do we need in
Hadoop/Distributed-Computer?
– Easy, Scalable, Reliable
● Want traditional apps to work with DFS
– Support random Read/Write
– Standard FS interface (NFS)
● HDFS compatible
– Drop-in replacement, no recompile
9. 9
Easy: MapR Volumes
Groups related files/directories
into a single tree structure so
they can be easily organized,
managed, and secured.
●
Replication factor
●
Scheduled snapshots, mirroring
●
Data placement control
– By device-type, rack, or
geographic location
●
Quotas and usage tracking
●
Administrative permissions
100K+ Volumes are okay
10. 10
Each container contains
Directories & files
Data blocks
Replicated on servers
No need to manage directly
Use MapR Volumes
Scalable: Containers
Files/directories are sharded into
blocks, which are placed into mini-NNs
(containers) on disks
Containers are
16-32 GB disk
segments,
placed on
nodes
11. 11
CLDB
Scalable: Container Location DB
N1, N2
N3, N2
N1, N2
N1, N3
N3, N2
N1
N2
N3
Container location
database (CLDB) keeps
track of nodes hosting
each container and
replication chain order
Each container has a replication chain
Updates are transactional
Failures are handled by rearranging replication
Clients cache container locations
12. 12
Scalability Statistics
Containers represent 16 - 32GB of data
Each can hold up to 1 Billion files and directories
100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
25GB to cache all containers for 2EB cluster
− But not necessary, can page to disk
Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS
block-reports
Serve 100x more data-nodes
Increase container size to 64G to serve 4EB cluster
MapReduce performance not affected
14. 14
Reliable: CLDB High Availability
● As easy as installing CLDB role on more nodes
– Writes go to CLDB master, replicated to slaves
– CLDB slaves can serve reads
● Distributed container metadata, so CLDB only
stores/recovers container locations
– Instant restart (<2 seconds), no single POF
● Shared nothing architecture
● (NFS Multinode HA too)
15. 15
vs. Federated NN, NN HA
● Federated NameNodes
– Statically partition namespaces (like Volumes)
– Need additional NN (plus a standby) for each namespace
– Federated NN only in Hadoop-2.x (beta)
● NameNode HA
– NameNode responsible for both fs-namespace (metadata) info and block
locations; more data to checkpoint/recover.
– Starting standby NN from cold state can take tens-of-minutes for metadata,
an hour for block-locations. Need a hot standby.
– Metadata state
● All name space edits logged to shared (NFS/NAS) R/W storage, which must
also be HA; Standby polls edit log for changes.
● Or use Quorum Journal Manager, separate service/nodes
– Block locations
● Data nodes send block reports, location updates, heartbeats to both NNs
16. 16
Reliable: Consistent Snapshots
● Automatic
de-duplication
● Saves space by
sharing blocks
● Lightning fast
● Zero performance loss
on writing to original
● Scheduled,
or on-demand
● Easy recovery with
drag and drop
20. 20
Fast: Direct Shuffle
● Apache Shuffle
– Write map-outputs/spills to local file system
– Merge partitions for a map output into one file, index into it
– Reducers request partitions from Mappers' Http servlets
● MapR Direct Shuffle
– Write to Local Volume in MapR FS (rebalancing)
– Map-output file per reducer (no index file)
– Send shuffleRootFid with MapTaskCompletion on heartbeat
– Direct RPC from Reducer to Mapper using Fid
– Copy is just a file-system copy; no Http overhead
– More copy threads, wider merges
21. 21
Fast: Express Lane
● Long-running jobs shouldn't hog all the slots in the
cluster and starve small, fast jobs (e.g. Hive queries)
● One or more small slots reserved on each node for
running small jobs
● Small jobs: <10 maps/reds, small input, time limit
23. 23
Easy: Label-based Scheduling
● Assign labels to nodes or regex/glob expressions for nodes
– perfnode1* → “production”
– /.*ssd[0-9]*/ → “fast_ssd”
● Create label expressions for jobs/queues
– Queue “fast_prod” → “production && fast_ss”
● Tasks from these jobs/queues will only be assigned to nodes whose
labels match the expression.
● Combine with Data Placement policies for data and compute locality
● No static partitioning necessary
– Frequent labels file refresh
– New nodes automatically fall into appropriate regex/glob labels
– New jobs can specify label expression or use queue's or both
● http://www.mapr.com/doc/display/MapR/Placing+Jobs+on+Specified+Nodes
24. 24
Other Improvements
● Parallel Split Computations in JobClient
– Might as well multi-thread it!
● Runaway Job Protection
– One user's fork-bomb shouldn't degrade others' performance
– CPU/memory firewalls protect system processes
● Map-side join locality
– Files in same directory/container follow same replication chain
– Same key ranges likely to be co-located on same node.
● Zero-config XML
– XML parsing takes too much time
25. 25
MapR MapReduce Summary
● Fast
– Direct Shuffle
– Express Lane
– Parallel Split Computation
– Map-side Join Locality
– Zero-config XML
● Reliable
– JobTracker HA
– Runaway Job Protection
● Easy
– Label-based Scheduling
27. 27
M7: Enterprise-Grade HBase
Disks
ext3
JVM
DFS
JVM
HBase
Other
Distributions
Disks
Unified
Easy Dependable Fast
No RegionServers No compactions Consistent low latency
Seamless splits Instant recovery
from node failure
Real-time in-memory
configuration
Automatic merges Snapshots Disk and network
compression
In-memory column families Mirroring Reduced I/O to disk
Unified Data Platform
Increased Performance
Simplified Administration
28. 28
Apache Drill
Interactive analysis of Big Data using standard SQL
Based on Google Dremel
Interactive queries
Data analyst
Reporting
100 ms-20 min
Data mining
Modeling
Large ETL
20 min-20 hr
MapReduce
Hive
Pig
Fas
t
• Low latency queries
• Columnar execution
• Complement native interfaces
and MapReduce/Hive/Pig
Op
en
• Community driven open source project
• Under Apache Software Foundation
Mo
der
n
• Standard ANSI SQL:2003 (select/into)
• Nested/hierarchical data support
• Schema is optional
• Supports RDBMS, Hadoop and NoSQL
31. 31
Contact Us!
I'm not in Sales, so go to mapr.com to learn more:
– Integrations with AWS, GCE, Ubuntu, Lucidworks
– Partnerships, Customers
– Support, Training, Pricing
– Ecosystem Components
We're hiring!
University of Wisconsin-Madison Career Fair tomorrow
Email me at: abordelon@maprtech.com
31