Offline and Near-Real time data processing Not online
Simple map-reduce is easy – but it can get complicated very quickly.
Assume users know about Hadoop Streaming
Nomenclature: Core switch and Top of Rack
Compare to a standard unix file system
Rack local and node local access rocks Scalability is bound by switches
Point out that now we know how HDFS works – we can run maps close to data
Point out how data local computing is useful in this example Exposes some of the features we need in hadoop – output of reducer can be directly sent to another reducer As an exercise to the reader – the results from the shell do not equal those from hadoop – interesting to find why.