Recent developments in Hadoop version 2 are pushing the system from the traditional, batch oriented, computational model based on MapRecuce towards becoming a multi paradigm, general purpose, platform. In the first part of this talk we will review and contrast three popular processing frameworks. In the second part we will look at how the ecosystem (eg. Hive, Mahout, Spark) is making use of these new advancements. Finally, we will illustrate "use cases" of batch, interactive and streaming architectures to power traditional and "advanced" analytics applications.
12. All in all
Great when records (jobs) are independent!
Composability monsters!
Computation vs. Communication tradeoff!
Low level API!
Tuning required
GABRIELE MODENA LEARNING HADOOP 2
15. Dremel (Impala, Drill)
Google paper (2010) !
Access blocks directly from data nodes (partition
the fs namespace)!
Columnar store (optimize for OLAP)!
Appeals to database / BI crowds!
Ridiculously fast (as long as you have memory)
GABRIELE MODENA LEARNING HADOOP 2
19. Tez (Dryad)
Microsoft paper (2007)!
Generalization of MapReduce as dataflow!
Express dependencies, I/O pipelining!
Low level API for building DAGs!
Mainly an execution engine (Hive-on-Tez, Pig-on-Tez)
GABRIELE MODENA LEARNING HADOOP 2
21. DAG dag = new DAG("WordCount");
dag.addVertex(tokenizerVertex)
.addVertex(summerVertex)
.addEdge(
new Edge(tokenizerVertex, summerVertex,
edgeConf.createDefaultEdgeProperty()));
GABRIELE MODENA LEARNING HADOOP 2
22. p!!ackage org.apache.tez.mapreduce.examples; import java.io.IOException;
import java.util.Map;
import java.util.StringTokenizer;
i!mport java.util.TreeMap; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileAlreadyExistsException;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.yarn.api.records.LocalResource;
import org.apache.tez.client.TezClient;
import org.apache.tez.dag.api.DAG;
import org.apache.tez.dag.api.Edge;
import org.apache.tez.dag.api.InputDescriptor;
import org.apache.tez.dag.api.OutputDescriptor;
import org.apache.tez.dag.api.ProcessorDescriptor;
import org.apache.tez.dag.api.TezConfiguration;
import org.apache.tez.dag.api.Vertex;
import org.apache.tez.dag.api.client.DAGClient;
import org.apache.tez.dag.api.client.DAGStatus;
import org.apache.tez.mapreduce.committer.MROutputCommitter;
import org.apache.tez.mapreduce.common.MRInputAMSplitGenerator;
import org.apache.tez.mapreduce.hadoop.MRHelpers;
import org.apache.tez.mapreduce.input.MRInput;
import org.apache.tez.mapreduce.output.MROutput;
import org.apache.tez.mapreduce.processor.SimpleMRProcessor;
import org.apache.tez.runtime.api.Output;
import org.apache.tez.runtime.library.api.KeyValueReader;
import org.apache.tez.runtime.library.api.KeyValueWriter;
import org.apache.tez.runtime.library.api.KeyValuesReader;
i!mport org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfigurer; import com.google.common.base.Preconditions;
import org.apache.tez.runtime.library.partitioner.HashPartitioner; !!
public class WordCount extends Configured implements Tool {
public static class TokenProcessor extends SimpleMRProcessor {
IntWritable one = new IntWritable(1);
! Text word = new Text(); @Override
public void run() throws Exception {
Preconditions.checkArgument(getInputs().size() == 1);
Preconditions.checkArgument(getOutputs().size() == 1);
MRInput input = (MRInput) getInputs().values().iterator().next();
KeyValueReader kvReader = input.getReader();
Output output = getOutputs().values().iterator().next();
KeyValueWriter kvWriter = (KeyValueWriter) output.getWriter();
while (kvReader.next()) {
StringTokenizer itr = new StringTokenizer(kvReader.getCurrentValue().toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
kvWriter.write(word, one);
}
}
! } ! } public static class SumProcessor extends SimpleMRProcessor {
@Override
public void run() throws Exception {
Preconditions.checkArgument(getInputs().size() == 1);
MROutput out = (MROutput) getOutputs().values().iterator().next();
KeyValueWriter kvWriter = out.getWriter();
KeyValuesReader kvReader = (KeyValuesReader) getInputs().values().iterator().next()
.getReader();
while (kvReader.next()) {
Text word = (Text) kvReader.getCurrentKey();
int sum = 0;
for (Object value : kvReader.getCurrentValues()) {
sum += ((IntWritable) value).get();
}
kvWriter.write(word, new IntWritable(sum));
}
}
}
! private DAG createDAG(FileSystem fs, TezConfiguration tezConf,
Map<String, LocalResource> localResources, Path stagingDir,
! String inputPath, String outputPath) throws IOException { Configuration inputConf = new Configuration(tezConf);
inputConf.set(FileInputFormat.INPUT_DIR, inputPath);
InputDescriptor id = new InputDescriptor(MRInput.class.getName())
.setUserPayload(MRInput.createUserPayload(inputConf,
! TextInputFormat.class.getName(), true, true)); Configuration outputConf = new Configuration(tezConf);
outputConf.set(FileOutputFormat.OUTDIR, outputPath);
OutputDescriptor od = new OutputDescriptor(MROutput.class.getName())
.setUserPayload(MROutput.createUserPayload(
! outputConf, TextOutputFormat.class.getName(), true)); Vertex tokenizerVertex = new Vertex("tokenizer", new ProcessorDescriptor(
TokenProcessor.class.getName()), -1, MRHelpers.getMapResource(tezConf));
! tokenizerVertex.addInput("MRInput", id, MRInputAMSplitGenerator.class); Vertex summerVertex = new Vertex("summer",
! new ProcessorDescriptor(
SumProcessor.class.getName()), 1, MRHelpers.getReduceResource(tezConf));
summerVertex.addOutput("MROutput", od, MROutputCommitter.class); OrderedPartitionedKVEdgeConfigurer edgeConf = OrderedPartitionedKVEdgeConfigurer
.newBuilder(Text.class.getName(), IntWritable.class.getName(),
! HashPartitioner.class.getName(), null).build(); DAG dag = new DAG("WordCount");
dag.addVertex(tokenizerVertex)
.addVertex(summerVertex)
.addEdge(
return dag;
! } private static void printUsage() {
new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty()));
System.err.println("Usage: " + " wordcount <in1> <out1>");
ToolRunner.printGenericCommandUsage(System.err);
! } public boolean run(String inputPath, String outputPath, Configuration conf) throws Exception {
System.out.println("Running WordCount");
// conf and UGI
TezConfiguration tezConf;
if (conf != null) {
tezConf = new TezConfiguration(conf);
} else {
tezConf = new TezConfiguration();
}
UserGroupInformation.setConfiguration(tezConf);
! String user = UserGroupInformation.getCurrentUser().getShortUserName(); // staging dir
FileSystem fs = FileSystem.get(tezConf);
String stagingDirStr = Path.SEPARATOR + "user" + Path.SEPARATOR
+ user + Path.SEPARATOR+ ".staging" + Path.SEPARATOR
+ Path.SEPARATOR + Long.toString(System.currentTimeMillis());
Path stagingDir = new Path(stagingDirStr);
tezConf.set(TezConfiguration.TEZ_AM_STAGING_DIR, stagingDirStr);
stagingDir = fs.makeQualified(stagingDir);
// No need to add jar containing this class as assumed to be part of
! // the tez jars. // TEZ-674 Obtain tokens based on the Input / Output paths. For now assuming staging dir
// is the same filesystem as the one used for Input/Output.
TezClient tezSession = new TezClient("WordCountSession", tezConf);
! tezSession.start(); ! DAGClient dagClient = null; try {
if (fs.exists(new Path(outputPath))) {
throw new FileAlreadyExistsException("Output directory "
+ outputPath + " already exists");
}
Map<String, LocalResource> localResources =
new TreeMap<String, LocalResource>();
DAG dag = createDAG(fs, tezConf, localResources,
! stagingDir, inputPath, outputPath); tezSession.waitTillReady();
! dagClient = tezSession.submitDAG(dag); // monitoring
DAGStatus dagStatus = dagClient.waitForCompletionWithAllStatusUpdates(null);
if (dagStatus.getState() != DAGStatus.State.SUCCEEDED) {
System.out.println("DAG diagnostics: " + dagStatus.getDiagnostics());
return false;
}
return true;
} finally {
fs.delete(stagingDir, true);
tezSession.stop();
}
! } @Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
! String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) {
printUsage();
return 2;
}
WordCount job = new WordCount();
job.run(otherArgs[0], otherArgs[1], conf);
return 0;
! } public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(res);
}
}
GABRIELE MODENA LEARNING HADOOP 2
23. Spark
AMPLab paper (2010), builds on Dryad!
Resilient Distributed Datasets (RDDs)!
High level API (and a repl)!
Also an execution engine (Hive-on-Spark, Pig-on-
Spark)
GABRIELE MODENA LEARNING HADOOP 2
24. JavaRDD<String> file = spark.textFile(“hdfs://infile.txt");
!
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
!
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
!
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
!
counts.saveAsTextFile(“hdfs://outfile.txt");
GABRIELE MODENA LEARNING HADOOP 2
25. Rule of thumb
Avoid spill-to-disk!
Spark and Tez don’t mix well!
Join on 50+ TB = Hive+Tez, MapReduce!
Direct access to API (in memory) = Spark!
OLAP = Hive+Tez, Cloudera Impala!
GABRIELE MODENA LEARNING HADOOP 2
33. Apache Hive
HiveQL !
Data stored on HDFS!
Metadata kept in mysql (metastore)!
Metadata exposed to third parties (HCatalog)!
Suitable both for interactive and batch queries
GABRIELE MODENA LEARNING HADOOP 2
36. The nature of Hive tables
CREATE TABLE and (LOAD DATA) produce metadata!
!
Schema based on the data “as it has already arrived”!
!
Data files underlying a Hive table are no different from any
other file on HDFS!
!
Primitive types behave as in Java
GABRIELE MODENA LEARNING HADOOP 2
37. Data formats
Record oriented (avro, text)!
Column oriented (Parquet, Orc)
GABRIELE MODENA LEARNING HADOOP 2
38. Text (tab separated)
create external table tweets
(
created_at string,
tweet_id string,
text string,
in_reply_to string,
retweeted boolean,
user_id string,
place_id string
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION ‘$input’
$ hadoop fs -cat /data/tweets.tsv
2014-03-12T17:34:26.000Z!443802208698908672! Oh & I'm chuffed for
@GeraintThomas86, doing Wales proud in yellow!! #ParisNice #Cymru! NULL!
223224878! NULL
2014-03-12T17:34:26.000Z!443802208706908160! Stalker48 Kembali Lagi Cek
Disini http://t.co/4BMTFByFH5 236! NULL! 629845435! NULL
2014-03-12T17:34:26.000Z!443802208728268800! @Piconn ou melhor, eu era :c
mudei! NULL! 255768055! NULL
2014-03-12T17:34:26.000Z!443802208698912768! I swear Ryan's always in his
own world. He's always like 4 hours behind everyone else.! NULL!
2379282889! NULL
2014-03-12T17:34:26.000Z!443802208702713856! @maggersforever0 lmfao you
gotta see this, its awesome http://t.co/1PvXEELlqi! NULL! 355858832!
NULL
2014-03-12T17:34:26.000Z!443802208698896384! Crazy... http://t.co/
G4QRMSKGkh! NULL! 106439395! NULL!
GABRIELE MODENA LEARNING HADOOP 2
•
41. Some thoughts on schemas
Only make additive changes!
Think about schema distribution!
Manage schema versions explicitly
GABRIELE MODENA LEARNING HADOOP 2
42. Parquet
!
Ad hoc use case!
Cloudera Impala’s default file format!
Execution engine agnostic!
HIVE-5783!
Let it handle block size!
!
create table tweets (
created_at string,
tweet_id string,
text string,
in_reply_to string,
retweeted boolean,
user_id string,
place_id string
) STORED AS PARQUET;
!
insert into table tweets
select * from tweets_ext;
GABRIELE MODENA LEARNING HADOOP 2
44. Table Optimization
Create tables with workloads in mind!
Partitions!
Bucketing!
Join strategies
GABRIELE MODENA LEARNING HADOOP 2
45. Plenty of tunables !!
# partitions
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.created.files=1000000;
!
# merge small files
SET hive.merge.size.per.task=256000000;
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.smallfiles.avgsize=16000000;
# Compression
SET mapred.output.compress=true;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec;
GABRIELE MODENA LEARNING HADOOP 2
50. Spark & Ipython Notebook
!
! from pyspark import SparkContext!
! sc = SparkContext(CLUSTER_URL,
‘ipython-notebook') !
Works with Avro, Parqeut etc!
Move computation close to
data!
Numpy, scikit-learn, matplotlib!
Setup can be tedious
GABRIELE MODENA LEARNING HADOOP 2
51. Stream processing
Statistics in real time!
Data feeds!
Machine generated (sensor data, logs)!
Predictive analytics
GABRIELE MODENA LEARNING HADOOP 2
52. Several niches
Low latency (storm, s4)!
Persistency and resiliency (samza)!
Apply complex logic (spark-streaming)!
Type of message stream (kafka)
GABRIELE MODENA LEARNING HADOOP 2
53. Apache Samza
Kafka for streaming !
Yarn for resource
management and exec!
Samza API for
processing!
Sweet spot: second,
minutes
Samza API
Yarn
Kafka
GABRIELE MODENA LEARNING HADOOP 2
54. public void process(
IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator)
55. public void window(
MessageCollector collector,
TaskCoordinator coordinator)
56. Bootstrap streams
Samza can consume messages from multiple
streams!
Rewind on historical data does not preserve
ordering!
If a task has any bootstrap streams defined then
it will read these streams until they are fully
processed
GABRIELE MODENA LEARNING HADOOP 2
58. Learning from data
Predictive model = statistical learning!
Simple = parallelizable!
Garbage in = garbage out
GABRIELE MODENA LEARNING HADOOP 2
59. Couple of things we can do
1. Parameter tuning
2. Feature engineering
3. Learn on all data
GABRIELE MODENA LEARNING HADOOP 2
60. Train against all data
Ensamble methods (cooperative and competitive)!
Avoid multi pass / iterations!
Apply models to live data!
Keep models up to date
GABRIELE MODENA LEARNING HADOOP 2
61. Off the shelf
Apache Mahout (MapReduce, Spark) !
MLlib (Spark)!
Cascading-pattern (MapReduce, Tez, Spark)
GABRIELE MODENA LEARNING HADOOP 2
62. Apache Mahout 0.9
Once the default solution for ML with MapReduce!
Quality may vary!
Good components are really good!
Is it a library? A framework? A recommendation
system?
GABRIELE MODENA LEARNING HADOOP 2
63. The good
The go-to if you need a Recommendation System!
SGD (optimization)!
Random Forest (classification/regression)!
SVD (feature engineering)!
ALS (collaborative filtering)
GABRIELE MODENA LEARNING HADOOP 2
64. The puzzling
SVM? !
Model updates are implementation specific!!
Feature encoding and input format are often
model specific
GABRIELE MODENA LEARNING HADOOP 2
65. Apache Mahout trunk
Moving away from MapReduce!
Spark + Scala DSL = new classes of algorithms!
Major code cleanup
GABRIELE MODENA LEARNING HADOOP 2
70. With hadoop 2
Cluster as an Operating System!
YARN, mostly!
Multiparadigm, better interop!
Same system, different tools, multiple use cases!
Batch + interactive
GABRIELE MODENA LEARNING HADOOP 2
71. This said
Ops is where a lot of time goes!
Building clusters is hard!
Distro fragmentation!
Bleeding edge rush!
Heavy lifting needed
GABRIELE MODENA LEARNING HADOOP 2