Llnl talk

MapR Architecture and Machine Learning 1

Outline MapR system overview Map-reduce review MapR architecture Performance Results Map-reduce on MapR Machine learning on MapR

Map-Reduce Shuffle Input Output

Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes

MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges

MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors

MapR'sContainers Files/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks ,[object Object]

No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes

Container locations and replication CLDB N1, N2 N1 N3, N2 N1, N2 N2 N1, N3 N3, N2 N3 Container location database (CLDB) keeps track of nodes hosting each container

MapR Scaling Containers represent 16 - 32GB of data ,[object Object]

100M containers = ~ 2 Exabytes (a very large cluster)250 bytes DRAM to cache a container ,[object Object]

But not necessary, can page to disk

Typical large 10PB cluster needs 2GBContainer-reports are 100x - 1000x < HDFS block-reports ,[object Object]

Increase container size to 64G to serve 4EB cluster

Map/reduce not affected,[object Object]

Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better

MUCH faster for some operations Same 10 nodes … Teststoppedhere Create Rate # of files (millions)

MUCH faster for some operations

NFS mounting models Export to the world NFS gateway runs on selected gateway hosts Local server NFS gateway runs on local host Enables local compression and check summing Export to self NFS gateway runs on all data nodes, mounted from localhost

Export to the world NFS Server NFS Server NFS Server NFS Server NFS Client

Local server Client Application NFS Server Cluster Nodes

Universal export to self Cluster Nodes Cluster Node Application NFS Server

Cluster Node Application NFS Server Cluster Node Application Cluster Node Application NFS Server NFS Server Nodes are identical

Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions can’t use shard id! Must manage local disk space

Conventional data flows Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk

Simplified NFS data flows Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.

Application to machine learning So now we have the hammer Let’s see some nails!

K-means Classic E-M based algorithm Given cluster centroids, Assign each data point to nearest centroid Accumulate new centroids Rinse, lather, repeat

K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids

Llnl talk

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Llnl talk

Similaire à Llnl talk (20)

Plus de Ted Dunning

Plus de Ted Dunning (20)

Dernier

Dernier (20)

Llnl talk