Talk given on September 21 to the Bay Area R User Group. The talk walks a stochastic project SVD algrorithm through the steps from initial implementation in R to a proposed implementation using map-reduce that integrates cleanly with R via NFS export of the distributed file system. Not surprisingly, this algorithm is essentially the same as the one used by Mahout.
2. The bad old days (i.e. now) Hadoop is a silo HDFS isn’t a normal file system Hadoop doesn’t really like C++ R is limited One machine, one memory space Isn’t there any way we can just get along?
3. The white knight MapR changes things Lots of new stuff like snapshots, NFS All you need to know, you already know NFS provides cluster wide file access Everything works the way you expect Performance high enough to use as a message bus
4. Example, out-of-core SVD SVD provides compressed matrix form Based on sum of rank-1 matrices ≈ + + ? ± ±
7. Also known as … Latent Semantic Indexing PCA Eigenvectors
8. An application, approximate translation Translation distributes over concatenation But counting turns concatenation into addition This means that translation is linear! ish
10. Traditional computation Products of A are dominated by large singular values and corresponding vectors Subtracting these dominate singular values allows the next ones to appear Lanczos method, generally Krylov sub-space
19. Hybrid architecture Map-reduce Feature extraction and down sampling Via NFS Map-reduce R Visualization Block-wise parallel SVD
20. Conclusions Inter-operability allows massively scalability Prototyping in R not wasted Map-reduce iteration not needed for SVD Feasible scale ~10^9 non-zeros or more