37. Analyzing big data with open
source R and Hadoop
Steven Sit
IBM Silicon Valley Laboratory
38. 39
Outline
• R on Hadoop
• Rationale and requirements
• Various approaches
• Using R and “Big R” packages
• Data exploration, statistical analysis and Visualization
• Scale out R with data partitioning
• Distributed machine learning
• Demo
39. 40
Rationale and requirements
Productivity
• Use natural R syntax to access massive data in Hadoop
• Leverage existing packages and libraries
Platform for analytics
• Common input and output data formats across analytic
algorithms
Customizability & Extensibility
• Easily customize existing algorithms
• Support for both conventional data mining and large
scale ML needs
Scalability & Performance
• Scale to massively parallel clusters
• Analyze terabytes and petabytes of data
40. 41
Streams
ETL Tools SQOOP
Flume
NFSRest API
Files NoSQL DB
Hadoop
Business Intelligence
Tools
ODBC
JDBC,
Utilities
Warehouses, Marts,
MDM
Analytic Runtimes
Analytical Tools
SQL engines
DDL
Database Tools
PIG, Java, etc Exploration Tools Search
Indexes
Models
Sources
Common Patterns for Hadoop and
Analytics
41. 42
Quick Tour of R
• R is an interpreted language
• Open-source implementation of the S language (1976)
• Best suited for statistical analysis and modeling
• Data exploration and manipulation
• Descriptive statistics
• Predictive analytics and machine learning
• Visualization
• +++
• Can produce “publication quality graphics”
• Emerging as a competitor to proprietary platforms
42. 43
R is hot!... but
• Quite lean as far as software goes
• Free!
• State of the art algorithms
• Statistical researchers often
provide their methods as R
packages
• New techniques available
without delay
• Commercial packages usually
behind the curve
• 4700+ packages as of today
• Active and vibrant user
community
• Universities are teaching R
• IT shops are adopting R
• Companies are integrating R into
their products
• R jobs and demand for R skills on
the rise
Unfortunately R is not built for Big
Data
Working with large datasets is
limited by RAM
43. 44
R and Big Data: Various
approaches• R + Hadoop streaming
• Use R as a scripting language
• Write map/reduce jobs in R
• Or Python for mappers, and R for the reducers
• R + open-source packages for Hadoop
• RHadoop
• RMR2: Write low-level map/reduce with API in R
• RHBase and RHDFS: Simple R API over Hadoop capabilities
• RHipe
• R + SparkR + Hadoop
• R frontend to Spark in memory structure and data from Hadoop
• R + High Level Languages
• JaQL<->R “bridge”
• R interfaces over non-R engines
• ScalaR from Revolution R
• Oracle’s ORE, Netezza’s NZR, Teradata’s teradataR, +++
44. 45
RHadoop Example
Given a table describing flight information for every airline spanning
several decades:
“Carrier”, “Year”, “Month”, “Day”, “DepDelay”, …
What’s the mean departure delay (“DepDelay”)
for each airline for each month?
45. 46
RHadoop Example (contd.)
Map Reduce
key =
c(Carrier, Year, Mo
nth)
value = DepDelay
key =
c(Carrier, Year, Mo
nth)
value =
vector(DepDelay)
key = NULL
value = vector (columns
in a line)
key =
c(Carrier, Year, Mo
nth)
value =
mean(DepDelay)
Shuffle
46. 47
RHadoop Example (contd.)
csvtextinputformat = function(line) keyval(NULL, unlist(strsplit(line, ",")))
deptdelay = function (input, output) {
mapreduce(input = input,
output = output,
textinputformat = csvtextinputformat,
map = function(k, fields) {
# Skip header lines and bad records:
if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {
deptDelay <- fields[[16]]
# Skip records where departure dalay is "NA":
if (!(identical(deptDelay, "NA"))) {
# field[9] is carrier, field[1] is year, field[2] is month:
keyval(c(fields[[9]], fields[[1]], fields[[2]]), deptDelay)}}},
reduce = function(keySplit, vv) {
keyval(keySplit[[2]], c(keySplit[[3]], length(vv), keySplit[[1]], mean(as.numeric(vv))))})}
from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))
Source: http://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html
47. 48
What is “Big R”
• Explore, visualize, transform,
and model big data using
familiar R syntax and
paradigm
• Scale out R with MR
programming
• Partitioning of large data
• Parallel cluster execution of R
code
• Distributed Machine Learning
• A scalable statistics engine
that provides canned
algorithms, and an ability to
author new ones, all via R
R Clients
Scalabl
e
Machin
e
Learnin
g
Data Sources
Embedde
d R
Executio
n
IBM R Packages
IBM R Packages
Pull data
(summaries)
to R client
Or, push R
functions
right on the
data
1
2
3
48. Let‟s mix it up a little
...
... with a demo
interspersed with
slides.
49. 50
Big R Demo Data
• Publicly available “airline” data
• 22 years of actual arrival/departure information
• Every scheduled flight in the US
• 1987 onwards
• From U.S. Department of Transportation
• Cleansed version
• http://stat-computing.org/dataexpo/2009/the-data.html
50. 51
Airline data description
Year 1987-2008
Month 1-12
DayofMonth 1-31
DayOfWeek 1 (Monday) - 7 (Sunday)
DepTime actual departure time (local, hhmm)
CRSDepTime scheduled departure time (local, hhmm)
ArrTime actual arrival time (local, hhmm)
CRSArrTime scheduled arrival time (local, hhmm)
UniqueCarrier unique carrier code
FlightNum flight number
TailNum plane tail number
ActualElapsedTime in minutes
CRSElapsedTime in minutes
AirTime in minutes
ArrDelay arrival delay, in minutes
51. Airline data description
(contd.)
Origin origin IATA airport code
Dest destination IATA airport code
Distance in miles
TaxiIn taxi in time, in minutes
TaxiOut taxi out time in minutes
Cancelled was the flight cancelled?
CancellationCode reason for cancellation (A = carrier, B =
weather, C = NAS, D = security)
Diverted 1 = yes, 0 = no
CarrierDelay in minutes
WeatherDelay in minutes
NASDelay in minutes
SecurityDelay in minutes
LateAircraftDelay in minutes
52. 53
Explore, visualize, transform, model Hadoop data with
R
• Represent Big Data objects as R datatypes
• R's programming syntax and paradigm
• Data stays in HDFS
• R classes (e.g. bigr.data.frame) as proxies
• No low-level map/reduce constructs
• No underlying scripting languages
1
# Construct a bigr.frame to access large data set
air <- bigr.frame(dataPath="airline_demo.csv", …)
# How many flights were flown by United or Delta?
length(UniqueCarrier[UniqueCarrier %in% c("UA", "DL")])
# Filter all flights that were delayed by 15+ minutes at departure or arrival.
airSubset <- air[air$Cancelled == 0 & (air$DepDelay >= 15 | air$ArrDelay >= 15),
c("UniqueCarrier", "Origin", "Dest", "DepDelay", "ArrDelay")]
# For these filtered flights, compute key statistics (# of flights,
# average flying distance and flying time), grouped by airline
summary(count(UniqueCarrier) + mean(Distance) + mean(CRSElapsedTime) ~
UniqueCarrier, dataset = airSubset)
Bigr.boxplot(air$Distance ~ air$UniqueCarrier, …)
53. 54
Scale out R in Hadoop
• Support parallel / partitioned execution of R
• Work around R’s memory limitations
• Execute R snippets on chunks of data
• Partitioned by key, by #rows, via sampling, …
• Follows R’s “apply” model
• Parallelized seamlessly Map/Reduce engine
2
# Filter the airline data on United and Hawaiian
bf <- air[air$UniqueCarrier %in% c("HA", "UA"),]
# Build one decision-tree model per airliner
models <- groupApply(data = bf, groupingColumns = list(bf$UniqueCarrier),
rfunction = function(df) { library(rpart)
predcols <- c('ArrDelay', 'DepDelay', 'DepTime'', 'Distance')
return (rpart(ArrDelay ~ ., df[,predcols]))})
# Pull the model for HA to the client
modelHA <- bigr.pull(models$HA)
# Visualize the model
prettyTree(modelHA)
54. Big R API: Core Classes & Methods
Connection handling
bigr.connect() and
bigr.disconnect()
is.bigr.connected()
bigr.frame
Modeled after R’s data.frame
Proxy for tabular data
bigr.vector
Modeled after R’s vector
datatype
Proxy for a column
bigr.list
(Loosely) modeled after R’s list
Proxy for collections of serialized
R objects
• Basic exploration
• head(), tail()
• dim(), length(), nrow(
), ncol()
• str(), summary()
• Selection and Projections
• [
• $
• Arithmetic and Logical
operators
• +, -, /, -
• &, |, !
• ifelse()
• String and Math functions
• Lots of these
• Other relational operators
• table()
• unique()
• merge()
• summary()
• sort()
• na.omit(), na.exclude()
• Data movement
• as.data.frame, as.bigr.fram
e
• bigr.persist
• Sampling
• bigr.sample()
• bigr.random()
• Visualizations
• built into R packages (E.g.
ggplot2)
55. Big R API: Core Classes & Methods
(contd.)
• Summarization and
aggregation
• summary(), very
powerful when
used with
“formula” notation
• max(), min(), ran
ge()
• Mean, variance and
standard deviation
• mean()
• var()
• sd()
• Correlation
• cov() and cor()
• Hadoop options
• bigr.get.server.option()
• bigr.set.server.option()
• bigr.debug(T)
• Useful for servicing
• Will print out internal
debugging output
• Catalog access
• bigr.listfs()
• bigr.listTables(), bigr.lis
tColumns()
• groupApply()
• Primary function for
embedded execution
• Can return tables or
objects
• Run “help(groupApply)”
inside R for extensive
documentation
• Examining R execution logs
• bigr.logs()
• Other *Apply functions
• rowApply(), for running R
on batches of rows
• tableApply(), for running R
on entire dataset
56. 57
Example: What is the average scheduled flight time, actual
gate-to-gate time, and actual airtime for each city pair per
year?
mapper.year.market.enroute_time = function(key, val) {
# Skip header lines, cancellations, and diversions:
if ( !identical(as.character(val['Year']), 'Year')
& identical(as.numeric(val['Cancelled']), 0)
& identical(as.numeric(val['Diverted']), 0) ) {
# We don't care about direction of travel, so construct 'market'
# with airports ordered alphabetically
# (e.g, LAX to JFK becomes 'JFK-LAX'
if (val['Origin'] < val['Dest'])
market = paste(val['Origin'], val['Dest'], sep='-')
else
market = paste(val['Dest'], val['Origin'], sep='-')
# key consists of year, market
output.key = c(val['Year'], market)
# output gate-to-gate elapsed times (CRS and actual) + time in
air
output.val =
c(val['CRSElapsedTime'], val['ActualElapsedTime'], val['AirTime'])
return( keyval(output.key, output.val) )
}
}
reducer.year.market.enroute_time = function(key, val.list) {
# val.list is a list of row vectors
# a data.frame is a list of column vectors
# plyr's ldply() is the easiest way to convert IMHO
if ( require(plyr) )
val.df = ldply(val.list, as.numeric)
else { # this is as close as my deficient *apply skills can come w/o
plyr
val.list = lapply(val.list, as.numeric)
val.df = data.frame( do.call(rbind, val.list) )
}
colnames(val.df) = c('actual','crs','air')
output.key = key
output.val = c( nrow(val.df), mean(val.df$actual, na.rm=T),
mean(val.df$crs, na.rm=T),
mean(val.df$air, na.rm=T) )
return( keyval(output.key, output.val) )
}
mr.year.market.enroute_time = function (input, output) {
mapreduce(input = input,
output = output,
input.format = asa.csvtextinputformat,
map = mapper.year.market.enroute_time,
reduce = reducer.year.market.enroute_time,
backend.parameters = list(
hadoop = list(D = "mapred.reduce.tasks=10")
),
verbose=T)
}
hdfs.output.path = file.path(hdfs.output.root, 'enroute-time')
results =
mr.year.market.enroute_time(hdfs.input.path, hdfs.output.path)
results.df = from.dfs(results, to.data.frame=T)
colnames(results.df) = c('year', 'market', 'flights', 'scheduled',
'actual', 'in.air')
save(results.df, file="out/enroute.time.RData")
RHadoop Implementation
(>35 lines of code)
Equivalent Big R Implementation
(4 lines of code)
air <- bigr.frame(dataPath = "airline.csv", dataSource = “DEL", na.string="NA")
air$City1 <- ifelse(air$Origin < air$Dest, air$Origin, air$Dest)
air$City2 <- ifelse(air$Origin >= air$Dest, air$Origin, air$Dest)
summary(count(UniqueCarrier) + mean(ActualElapsedTime) +
mean(CRSElapsedTime) + mean(AirTime) ~ Year + City1 + City2 ,
dataset = air[air$Cancelled == 0 & air$Diverted == 0,])
57. 58
Use Cases
• Where Big R works well
• When data can be partitioned cleanly ...
• ... and each partition fits in the memory of a server node
• In other words:
• size of entire data can be bigger than cluster memory
• size of individual partitions limited to node memory (due to R)
• Real-world customer scenarios we’ve seen:
• Model data from individual tractors
• Build time-series anomaly detection on IP address pairs
• Build models on each customer’s behavior
• And where it doesn’t ...
• When building one monolithic model on the entire data
• without sampling
• Without use some form of blending such as ensembles
58. 59
Large scale analytics in Hadoop
• Some workloads are not logically partitionable, or partitions are still large
• SystemML - a scalable engine running natively over Hadoop:
• Deliver pre-built ML algorithms
• Regression, Classification, Clustering, etc.
• Ability to author new algorithms
3
# Build a model on the entire data set
model <- bigr.lm(ArrDelay ~ ., df)
# Or, build several models, partitioned by airline
models <- groupApply(
input = air,
groupBy = air$UniqueCarrier,
function(df) {
# Linear regression
model <- bigr.lm(ArrDelay ~ ., df)
return (model)
})
59. 60
Association Rules
Association
• Apriori and pattern growth
for frequent itemsets
• sequence miner
Clustering
K-Means
Data Mining
Dimension Reduction
• Non-negative Matrix
Factorizations
• Principal Component
Analysis (large n, small p)
• Singular Value
Decompositions
Time Series Analysis
Granger modeling
Predictive Analytics
Regression
Linear Regression for large,
sparse datasets.
Generalized Linear Models.
Classification
Linear Logistic Regression
Trust Region Method
Linear SVMs
Modified Finite Newton Method
Random decision trees
for classification & regression
Ranking
PageRank of a directed
graph
HITS Hubs and Authorities
Optimization
Conjugate Gradient for
Sparse Linear Systems
Parallel Optimization for
sparse linear models
Stochastic Gradient Descent
Outlier Detection
Recursive Binning and
reprojection for distance
based outlier detection
Univariate
Scale, nominal, ordinal
variables
Scale: min, max, range,
mean, variance,
moments
(kurtosis, skewness),
order statistics (median,
quantile, iqm), outliers
Categorical: mode,
histograms
Bivariate
Scale/categorical
Eta, ftest, grouped
mean/variance/weight
Categorical/categorical
cramer‟s V, pearson
ChiSquare, spearman
Data Exploration
Large scale analytics (systemML) modules
Recommender
Systems
Matrix Completion
algorithms
Meta Learning
Ensemble Learning
Cross Validation
60. 61
Example - Recommendation Systems with
Collaborative Filtering
ratings
people
W
H
Kfactors
moviesK factors
people
1 1 0.10
1 2 0.30
: : :
1 1 0.10
1 2 0.30
1 3 0.22
1 4 1.24
: : :
: : :
movies
Analyzing both similarity of people and products
61. 62
Example: Topic Evolution in Social Media with
Clustering
tokens
documents
1 1 0.10
1 2 0.30
1 3 0.22
1 4 1.24
: : :
: : :
W
H
Ktopics
wordsK topics
documents
1 1 0.10
1 2 0.30
: : :
62. 63
Example: Gaussian Non-negative Matrix Factorization
package gnmf;
import java.io.IOException;
import java.net.URISyntaxException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
public class MatrixGNMF
{
public static void main(String[] args) throws IOException, URISyntaxException
{
if(args.length < 10)
{
System.out.println("missing parameters");
System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " +
"[k] [num mappers] [num reducers] [replication] [working directory] " +
"[final directory of w] [final directory of h]");
System.exit(1);
}
String vDir = args[0];
String wDir = args[1];
String hDir = args[2];
int k = Integer.parseInt(args[3]);
int numMappers = Integer.parseInt(args[4]);
int numReducers = Integer.parseInt(args[5]);
int replication = Integer.parseInt(args[6]);
String outputDir = args[7];
String wFinalDir = args[8];
String hFinalDir = args[9];
JobConf mainJob = new JobConf(MatrixGNMF.class);
String vDirectory;
String wDirectory;
String hDirectory;
FileSystem.get(mainJob).delete(new Path(outputDir));
vDirectory = vDir;
hDirectory = hDir;
wDirectory = wDir;
String workingDirectory;
String resultDirectoryX;
String resultDirectoryY;
long start = System.currentTimeMillis();
System.gc();
System.out.println("starting calculation");
System.out.print("calculating X = WT * V... ");
workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication,
UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating Y = WT * W * H... ");
workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication,
wDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating H = H .* X ./ Y... ");
workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication,
hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k);
System.out.println("done");
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back H... ");
FileSystem.get(mainJob).delete(new Path(hDirectory));
hDirectory = workingDirectory;
System.out.println("done");
System.out.print("calculating X = V * HT... ");
workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication,
UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating Y = W * H * HT... ");
workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication,
hDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating W = W .* X ./ Y... ");
workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication,
wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k);
System.out.println("done");
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back W... ");
FileSystem.get(mainJob).delete(new Path(wDirectory));
wDirectory = workingDirectory;
System.out.println("done");
package gnmf;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep2
{
static class UpdateWHStep2Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector>
{
@Override
public void map(TaggedIndex key, MatrixVector value,
OutputCollector<TaggedIndex, MatrixVector> out,
Reporter reporter) throws IOException
{
out.collect(key, value);
}
}
static class UpdateWHStep2Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject>
{
@Override
public void reduce(TaggedIndex key, Iterator<MatrixVector> values,
OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter)
throws IOException
{
MatrixVector result = null;
while(values.hasNext())
{
MatrixVector current = values.next();
if(result == null)
{
result = current.getCopy();
} else
{
result.addVector(current);
}
}
if(result != null)
{
out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X),
new MatrixObject(result));
}
}
}
public static String runJob(int numMappers, int numReducers, int replication,
String inputDir, String outputDir) throws IOException
{
String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/";
JobConf job = new JobConf(UpdateWHStep2.class);
job.setJobName("MatrixGNMFUpdateWHStep2");
job.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputDir));
job.setOutputFormat(SequenceFileOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(workingDirectory));
job.setNumMapTasks(numMappers);
job.setMapperClass(UpdateWHStep2Mapper.class);
job.setMapOutputKeyClass(TaggedIndex.class);
job.setMapOutputValueClass(MatrixVector.class);
package gnmf;
import gnmf.io.MatrixCell;
import gnmf.io.MatrixFormats;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep1
{
public static final int UPDATE_TYPE_H = 0;
public static final int UPDATE_TYPE_W = 1;
static class UpdateWHStep1Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject>
{
private int updateType;
@Override
public void map(TaggedIndex key, MatrixObject value,
OutputCollector<TaggedIndex, MatrixObject> out,
Reporter reporter) throws IOException
{
if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL)
{
MatrixCell current = (MatrixCell) value.getObject();
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL),
new MatrixObject(new MatrixCell(key.getIndex(), current.getValue())));
} else
{
out.collect(key, value);
}
}
@Override
public void configure(JobConf job)
{
updateType = job.getInt("gnmf.updateType", 0);
}
}
static class UpdateWHStep1Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector>
{
private double[] baseVector = null;
private int vectorSizeK;
@Override
public void reduce(TaggedIndex key, Iterator<MatrixObject> values,
OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter)
throws IOException
{
if(key.getType() == TaggedIndex.TYPE_VECTOR)
{
if(!values.hasNext())
throw new RuntimeException("expected vector");
MatrixFormats current = values.next().getObject();
if(!(current instanceof MatrixVector))
throw new RuntimeException("expected vector");
baseVector = ((MatrixVector) current).getValues();
} else
{
while(values.hasNext())
{
MatrixCell current = (MatrixCell) values.next().getObject();
if(baseVector == null)
{
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
new MatrixVector(vectorSizeK));
} else
{
if(baseVector.length == 0)
throw new RuntimeException("base vector is corrupted");
MatrixVector resultingVector = new MatrixVector(baseVector);
resultingVector.multiplyWithScalar(current.getValue());
if(resultingVector.getValues().length == 0)
throw new RuntimeException("multiplying with scalar failed");
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
resultingVector);
}
}
baseVector = null;
}
}
@Override
public void configure(JobConf job)
{
vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0);
if(vectorSizeK == 0)
Java Implementation
(>1500 lines of code)
Equivalent Big R - SystemML Implementation
(12 lines of code)
# Perform matrix operations, say non-negative factorization
# V ~~ WH
V <- bigr.matrix("V.mtx”); # initial matrix on HDFS
W <- bigr.matrix(nrow=nrow(V), ncols=k); # initialize starting points
H <- bigr.matrix(nrow=k, ncols=ncols(V));
for (i in 1:numiterations) {
H <- H * (t(W) %*% V / t(W) %*% W %*% H);
W <- W * (V %*% t(H) / W %*% H %*% t(H));
}
63. 64
Summary - Popular R Analytics on
Hadoop
• R is the preferred language for Data
Scientists
• Many approaches exist to enable R
on Hadoop with pros and cons
• “Big R” approach:
• Data exploration, statistical analysis and
Visualization with natural R syntax
• Scale out R with data partitioning
• Support for standard R tools and
existing packages and libraries
• SystemML engine for Distributed
Machine Learning that provides canned
algorithms, and an ability to author new
ones, all via R-like syntax
Data Sources
Hive Tables HBase Tables Files
R
Runtime
Hadoop
R Clients
Distributed
ML
Runtime