SlideShare une entreprise Scribd logo
1  sur  74
Full stack analytics 
with Hadoop 2 
Trento, 2014-09-11 
GABRIELE MODENA LEARNING HADOOP 2
CS.ML! 
Data Scientist 
ML & Data Mining 
Academia & Industry 
! 
Learning Hadoop 2 for 
Packt_Publishing (together 
with Garry Turkington). TBD.
This talk is about tools
Your mileage may vary
I will avoid benchmarks
Back in 2012 
GABRIELE MODENA LEARNING HADOOP 2
HDFS 
Name Node 
Data Node 
! 
! 
Google paper (2003)! 
Distributed storage! 
Block ops 
Name Node 
Data Node Data Node 
GABRIELE MODENA LEARNING HADOOP 2
MapReduce 
Google paper (2006)! 
Divide and conquer functional model! 
Concepts from database research! 
Batch worloads! 
Aggregation operations (eg. GROUP BY) 
GABRIELE MODENA LEARNING HADOOP 2
Two phases 
Map 
Reduce 
GABRIELE MODENA LEARNING HADOOP 2
Programs are chains 
of jobs
GABRIELE MODENA LEARNING HADOOP 2
All in all 
Great when records (jobs) are independent! 
Composability monsters! 
Computation vs. Communication tradeoff! 
Low level API! 
Tuning required 
GABRIELE MODENA LEARNING HADOOP 2
Computation with 
MapReduce 
CRUNCH 
GABRIELE MODENA LEARNING HADOOP 2
Higher level abstractions, 
still geared towards batch 
loads
Dremel (Impala, Drill) 
Google paper (2010) ! 
Access blocks directly from data nodes (partition 
the fs namespace)! 
Columnar store (optimize for OLAP)! 
Appeals to database / BI crowds! 
Ridiculously fast (as long as you have memory) 
GABRIELE MODENA LEARNING HADOOP 2
Computation beyond 
MapReduce 
Iterative workloads! 
Low latency queries! 
Real-time computation! 
High level abstractions 
GABRIELE MODENA LEARNING HADOOP 2
Hadoop 2 
Applications (Hive, Pig, Crunch, Cascading, etc…) 
Streaming 
(storm, spark, 
samza) 
In memory 
(spark) 
Interactive 
(Tez) 
HPC 
(MPI) 
Resource Management (YARN) 
HDFS 
Batch 
(MapReduce) 
Graph 
(giraph) 
GABRIELE MODENA LEARNING HADOOP 2
Tez (Dryad) 
Microsoft paper (2007)! 
Generalization of MapReduce as dataflow! 
Express dependencies, I/O pipelining! 
Low level API for building DAGs! 
Mainly an execution engine (Hive-on-Tez, Pig-on-Tez) 
GABRIELE MODENA LEARNING HADOOP 2
GABRIELE MODENA LEARNING HADOOP 2
DAG dag = new DAG("WordCount"); 
dag.addVertex(tokenizerVertex) 
.addVertex(summerVertex) 
.addEdge( 
new Edge(tokenizerVertex, summerVertex, 
edgeConf.createDefaultEdgeProperty())); 
GABRIELE MODENA LEARNING HADOOP 2
p!!ackage org.apache.tez.mapreduce.examples; import java.io.IOException; 
import java.util.Map; 
import java.util.StringTokenizer; 
i!mport java.util.TreeMap; import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.conf.Configured; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapred.FileAlreadyExistsException; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 
import org.apache.hadoop.security.UserGroupInformation; 
import org.apache.hadoop.util.GenericOptionsParser; 
import org.apache.hadoop.util.Tool; 
import org.apache.hadoop.util.ToolRunner; 
import org.apache.hadoop.yarn.api.records.LocalResource; 
import org.apache.tez.client.TezClient; 
import org.apache.tez.dag.api.DAG; 
import org.apache.tez.dag.api.Edge; 
import org.apache.tez.dag.api.InputDescriptor; 
import org.apache.tez.dag.api.OutputDescriptor; 
import org.apache.tez.dag.api.ProcessorDescriptor; 
import org.apache.tez.dag.api.TezConfiguration; 
import org.apache.tez.dag.api.Vertex; 
import org.apache.tez.dag.api.client.DAGClient; 
import org.apache.tez.dag.api.client.DAGStatus; 
import org.apache.tez.mapreduce.committer.MROutputCommitter; 
import org.apache.tez.mapreduce.common.MRInputAMSplitGenerator; 
import org.apache.tez.mapreduce.hadoop.MRHelpers; 
import org.apache.tez.mapreduce.input.MRInput; 
import org.apache.tez.mapreduce.output.MROutput; 
import org.apache.tez.mapreduce.processor.SimpleMRProcessor; 
import org.apache.tez.runtime.api.Output; 
import org.apache.tez.runtime.library.api.KeyValueReader; 
import org.apache.tez.runtime.library.api.KeyValueWriter; 
import org.apache.tez.runtime.library.api.KeyValuesReader; 
i!mport org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfigurer; import com.google.common.base.Preconditions; 
import org.apache.tez.runtime.library.partitioner.HashPartitioner; !! 
public class WordCount extends Configured implements Tool { 
public static class TokenProcessor extends SimpleMRProcessor { 
IntWritable one = new IntWritable(1); 
! Text word = new Text(); @Override 
public void run() throws Exception { 
Preconditions.checkArgument(getInputs().size() == 1); 
Preconditions.checkArgument(getOutputs().size() == 1); 
MRInput input = (MRInput) getInputs().values().iterator().next(); 
KeyValueReader kvReader = input.getReader(); 
Output output = getOutputs().values().iterator().next(); 
KeyValueWriter kvWriter = (KeyValueWriter) output.getWriter(); 
while (kvReader.next()) { 
StringTokenizer itr = new StringTokenizer(kvReader.getCurrentValue().toString()); 
while (itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
kvWriter.write(word, one); 
} 
} 
! } ! } public static class SumProcessor extends SimpleMRProcessor { 
@Override 
public void run() throws Exception { 
Preconditions.checkArgument(getInputs().size() == 1); 
MROutput out = (MROutput) getOutputs().values().iterator().next(); 
KeyValueWriter kvWriter = out.getWriter(); 
KeyValuesReader kvReader = (KeyValuesReader) getInputs().values().iterator().next() 
.getReader(); 
while (kvReader.next()) { 
Text word = (Text) kvReader.getCurrentKey(); 
int sum = 0; 
for (Object value : kvReader.getCurrentValues()) { 
sum += ((IntWritable) value).get(); 
} 
kvWriter.write(word, new IntWritable(sum)); 
} 
} 
} 
! private DAG createDAG(FileSystem fs, TezConfiguration tezConf, 
Map<String, LocalResource> localResources, Path stagingDir, 
! String inputPath, String outputPath) throws IOException { Configuration inputConf = new Configuration(tezConf); 
inputConf.set(FileInputFormat.INPUT_DIR, inputPath); 
InputDescriptor id = new InputDescriptor(MRInput.class.getName()) 
.setUserPayload(MRInput.createUserPayload(inputConf, 
! TextInputFormat.class.getName(), true, true)); Configuration outputConf = new Configuration(tezConf); 
outputConf.set(FileOutputFormat.OUTDIR, outputPath); 
OutputDescriptor od = new OutputDescriptor(MROutput.class.getName()) 
.setUserPayload(MROutput.createUserPayload( 
! outputConf, TextOutputFormat.class.getName(), true)); Vertex tokenizerVertex = new Vertex("tokenizer", new ProcessorDescriptor( 
TokenProcessor.class.getName()), -1, MRHelpers.getMapResource(tezConf)); 
! tokenizerVertex.addInput("MRInput", id, MRInputAMSplitGenerator.class); Vertex summerVertex = new Vertex("summer", 
! new ProcessorDescriptor( 
SumProcessor.class.getName()), 1, MRHelpers.getReduceResource(tezConf)); 
summerVertex.addOutput("MROutput", od, MROutputCommitter.class); OrderedPartitionedKVEdgeConfigurer edgeConf = OrderedPartitionedKVEdgeConfigurer 
.newBuilder(Text.class.getName(), IntWritable.class.getName(), 
! HashPartitioner.class.getName(), null).build(); DAG dag = new DAG("WordCount"); 
dag.addVertex(tokenizerVertex) 
.addVertex(summerVertex) 
.addEdge( 
return dag; 
! } private static void printUsage() { 
new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); 
System.err.println("Usage: " + " wordcount <in1> <out1>"); 
ToolRunner.printGenericCommandUsage(System.err); 
! } public boolean run(String inputPath, String outputPath, Configuration conf) throws Exception { 
System.out.println("Running WordCount"); 
// conf and UGI 
TezConfiguration tezConf; 
if (conf != null) { 
tezConf = new TezConfiguration(conf); 
} else { 
tezConf = new TezConfiguration(); 
} 
UserGroupInformation.setConfiguration(tezConf); 
! String user = UserGroupInformation.getCurrentUser().getShortUserName(); // staging dir 
FileSystem fs = FileSystem.get(tezConf); 
String stagingDirStr = Path.SEPARATOR + "user" + Path.SEPARATOR 
+ user + Path.SEPARATOR+ ".staging" + Path.SEPARATOR 
+ Path.SEPARATOR + Long.toString(System.currentTimeMillis()); 
Path stagingDir = new Path(stagingDirStr); 
tezConf.set(TezConfiguration.TEZ_AM_STAGING_DIR, stagingDirStr); 
stagingDir = fs.makeQualified(stagingDir); 
// No need to add jar containing this class as assumed to be part of 
! // the tez jars. // TEZ-674 Obtain tokens based on the Input / Output paths. For now assuming staging dir 
// is the same filesystem as the one used for Input/Output. 
TezClient tezSession = new TezClient("WordCountSession", tezConf); 
! tezSession.start(); ! DAGClient dagClient = null; try { 
if (fs.exists(new Path(outputPath))) { 
throw new FileAlreadyExistsException("Output directory " 
+ outputPath + " already exists"); 
} 
Map<String, LocalResource> localResources = 
new TreeMap<String, LocalResource>(); 
DAG dag = createDAG(fs, tezConf, localResources, 
! stagingDir, inputPath, outputPath); tezSession.waitTillReady(); 
! dagClient = tezSession.submitDAG(dag); // monitoring 
DAGStatus dagStatus = dagClient.waitForCompletionWithAllStatusUpdates(null); 
if (dagStatus.getState() != DAGStatus.State.SUCCEEDED) { 
System.out.println("DAG diagnostics: " + dagStatus.getDiagnostics()); 
return false; 
} 
return true; 
} finally { 
fs.delete(stagingDir, true); 
tezSession.stop(); 
} 
! } @Override 
public int run(String[] args) throws Exception { 
Configuration conf = getConf(); 
! String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { 
printUsage(); 
return 2; 
} 
WordCount job = new WordCount(); 
job.run(otherArgs[0], otherArgs[1], conf); 
return 0; 
! } public static void main(String[] args) throws Exception { 
int res = ToolRunner.run(new Configuration(), new WordCount(), args); 
System.exit(res); 
} 
} 
GABRIELE MODENA LEARNING HADOOP 2
Spark 
AMPLab paper (2010), builds on Dryad! 
Resilient Distributed Datasets (RDDs)! 
High level API (and a repl)! 
Also an execution engine (Hive-on-Spark, Pig-on- 
Spark) 
GABRIELE MODENA LEARNING HADOOP 2
JavaRDD<String> file = spark.textFile(“hdfs://infile.txt"); 
! 
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { 
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } 
}); 
! 
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { 
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } 
}); 
! 
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { 
public Integer call(Integer a, Integer b) { return a + b; } 
}); 
! 
counts.saveAsTextFile(“hdfs://outfile.txt"); 
GABRIELE MODENA LEARNING HADOOP 2
Rule of thumb 
Avoid spill-to-disk! 
Spark and Tez don’t mix well! 
Join on 50+ TB = Hive+Tez, MapReduce! 
Direct access to API (in memory) = Spark! 
OLAP = Hive+Tez, Cloudera Impala! 
GABRIELE MODENA LEARNING HADOOP 2
Good stuff. So what?
The data <adjective> 
S3, mysql, nfs, … 
HDFS 
Workflow coordination 
Ingestion Metadata 
Processing 
GABRIELE MODENA LEARNING HADOOP 2
Analytics on Hadoop 2 
Batch & interactive! 
Datawarehousing & computing! 
Dataset size and velocity! 
Integrations with existing tools! 
Distributions will constrain your stack 
GABRIELE MODENA LEARNING HADOOP 2
Use cases 
Datawarehousing! 
Explorative Data Analysis! 
Stream processing! 
Predictive Analytics 
GABRIELE MODENA LEARNING HADOOP 2
Datawarehousing 
Data ingestion! 
Pipelines! 
Transform and enrich (ETL) queries - batch! 
Low latency (presentation) queries - interactive! 
Interoperable data formats and metadata! 
Workflow Orchestration 
GABRIELE MODENA LEARNING HADOOP 2
Collection and ingestion 
$ hadoop distcp 
GABRIELE MODENA LEARNING HADOOP 2
Once data is in HDFS
Apache Hive 
HiveQL ! 
Data stored on HDFS! 
Metadata kept in mysql (metastore)! 
Metadata exposed to third parties (HCatalog)! 
Suitable both for interactive and batch queries 
GABRIELE MODENA LEARNING HADOOP 2
set hive.execution.engine=tez
set hive.execution.engine=mr
The nature of Hive tables 
CREATE TABLE and (LOAD DATA) produce metadata! 
! 
Schema based on the data “as it has already arrived”! 
! 
Data files underlying a Hive table are no different from any 
other file on HDFS! 
! 
Primitive types behave as in Java 
GABRIELE MODENA LEARNING HADOOP 2
Data formats 
Record oriented (avro, text)! 
Column oriented (Parquet, Orc) 
GABRIELE MODENA LEARNING HADOOP 2
Text (tab separated) 
create external table tweets 
( 
created_at string, 
tweet_id string, 
text string, 
in_reply_to string, 
retweeted boolean, 
user_id string, 
place_id string 
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY 't' 
STORED AS TEXTFILE 
LOCATION ‘$input’ 
$ hadoop fs -cat /data/tweets.tsv 
2014-03-12T17:34:26.000Z!443802208698908672! Oh &amp; I'm chuffed for 
@GeraintThomas86, doing Wales proud in yellow!! #ParisNice #Cymru! NULL! 
223224878! NULL 
2014-03-12T17:34:26.000Z!443802208706908160! Stalker48 Kembali Lagi Cek 
Disini http://t.co/4BMTFByFH5 236! NULL! 629845435! NULL 
2014-03-12T17:34:26.000Z!443802208728268800! @Piconn ou melhor, eu era :c 
mudei! NULL! 255768055! NULL 
2014-03-12T17:34:26.000Z!443802208698912768! I swear Ryan's always in his 
own world. He's always like 4 hours behind everyone else.! NULL! 
2379282889! NULL 
2014-03-12T17:34:26.000Z!443802208702713856! @maggersforever0 lmfao you 
gotta see this, its awesome http://t.co/1PvXEELlqi! NULL! 355858832! 
NULL 
2014-03-12T17:34:26.000Z!443802208698896384! Crazy... http://t.co/ 
G4QRMSKGkh! NULL! 106439395! NULL! 
GABRIELE MODENA LEARNING HADOOP 2 
•
SELECT COUNT(*) 
FROM tweets
Apache Avro 
Record oriented! 
Migrations (forward, backward)! 
Schema on write! 
Interoperability 
{ 
“namespace”: “com.mycompany.avrotables”, 
"name": "tweets", 
"type": "record", 
"fields": [ 
{"name": "created_at", "type": "string", “doc”: “date_time of tweet”}, 
{"name": "tweet_id_str", "type": "string"}, 
{"name": "text", "type": "string"}, 
{"name": "in_reply_to", "type": ["string", "null"]}, 
{"name": "is_retweeted", "type": ["string", "null"]}, 
{"name": "user_id", "type": "string"}, 
{"name": "place_id", "type": ["string", "null"]} 
] 
} 
CREATE TABLE tweets 
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
SERDEPROPERTIES ( 
'avro.schema.url'='hdfs:///schema/avro/tweets_avro.avsc' 
) ; 
insert 
into 
table 
tweets 
select 
* 
from 
tweets_ext; 
GABRIELE MODENA LEARNING HADOOP 2
Some thoughts on schemas 
Only make additive changes! 
Think about schema distribution! 
Manage schema versions explicitly 
GABRIELE MODENA LEARNING HADOOP 2
Parquet 
! 
Ad hoc use case! 
Cloudera Impala’s default file format! 
Execution engine agnostic! 
HIVE-5783! 
Let it handle block size! 
! 
create table tweets ( 
created_at string, 
tweet_id string, 
text string, 
in_reply_to string, 
retweeted boolean, 
user_id string, 
place_id string 
) STORED AS PARQUET; 
! 
insert into table tweets 
select * from tweets_ext; 
GABRIELE MODENA LEARNING HADOOP 2
If possible, use both
Table Optimization 
Create tables with workloads in mind! 
Partitions! 
Bucketing! 
Join strategies 
GABRIELE MODENA LEARNING HADOOP 2
Plenty of tunables !! 
# partitions 
SET hive.exec.dynamic.partition=true; 
SET hive.exec.dynamic.partition.mode=nonstrict; 
SET hive.exec.max.dynamic.partitions.pernode=10000; 
SET hive.exec.max.dynamic.partitions=100000; 
SET hive.exec.max.created.files=1000000; 
! 
# merge small files 
SET hive.merge.size.per.task=256000000; 
SET hive.merge.mapfiles=true; 
SET hive.merge.mapredfiles=true; 
SET hive.merge.smallfiles.avgsize=16000000; 
# Compression 
SET mapred.output.compress=true; 
SET mapred.output.compression.type=BLOCK; 
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; 
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; 
GABRIELE MODENA LEARNING HADOOP 2
Apache Oozie 
Data pipelines! 
Workflow execution and 
coordination! 
Time and availability based 
execution! 
Configuration over code! 
MapReduce centric! 
Actions Hive, Pig, fs, shell, 
sqoop 
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">! 
...! 
<action name="[NODE-NAME]">! 
<hive xmlns="uri:oozie:hive-action:0.2">! 
<job-tracker>[JOB-TRACKER]</job-tracker>! 
<name-node>[NAME-NODE]</name-node>! 
<prepare>! 
<delete path="[PATH]"/>! 
...! 
<mkdir path="[PATH]"/>! 
...! 
</prepare>! 
<job-xml>[HIVE SETTINGS FILE]</job-xml>! 
<configuration>! 
<property>! 
<name>[PROPERTY-NAME]</name>! 
<value>[PROPERTY-VALUE]</value>! 
</property>! 
...! 
</configuration>! 
<script>[HIVE-SCRIPT]</script>! 
<param>[PARAM-VALUE]</param>! 
...! 
<param>[PARAM-VALUE]</param>! 
<file>[FILE-PATH]</file>! 
...! 
<archive>[FILE-PATH]</archive>! 
...! 
</hive>! 
<ok to="[NODE-NAME]"/>! 
<error to="[NODE-NAME]"/>! 
</action>! 
...! 
</workflow-app> 
GABRIELE MODENA LEARNING HADOOP 2
EDA 
Luminosity in xkcd comics (courtesy of rbloggers) 
GABRIELE MODENA LEARNING HADOOP 2
Sample the dataset
Use hive-on-tez, impala
Spark & Ipython Notebook 
! 
! from pyspark import SparkContext! 
! sc = SparkContext(CLUSTER_URL, 
‘ipython-notebook') ! 
Works with Avro, Parqeut etc! 
Move computation close to 
data! 
Numpy, scikit-learn, matplotlib! 
Setup can be tedious 
GABRIELE MODENA LEARNING HADOOP 2
Stream processing 
Statistics in real time! 
Data feeds! 
Machine generated (sensor data, logs)! 
Predictive analytics 
GABRIELE MODENA LEARNING HADOOP 2
Several niches 
Low latency (storm, s4)! 
Persistency and resiliency (samza)! 
Apply complex logic (spark-streaming)! 
Type of message stream (kafka) 
GABRIELE MODENA LEARNING HADOOP 2
Apache Samza 
Kafka for streaming ! 
Yarn for resource 
management and exec! 
Samza API for 
processing! 
Sweet spot: second, 
minutes 
Samza API 
Yarn 
Kafka 
GABRIELE MODENA LEARNING HADOOP 2
public void process( 
IncomingMessageEnvelope envelope, 
MessageCollector collector, 
TaskCoordinator coordinator)
public void window( 
MessageCollector collector, 
TaskCoordinator coordinator)
Bootstrap streams 
Samza can consume messages from multiple 
streams! 
Rewind on historical data does not preserve 
ordering! 
If a task has any bootstrap streams defined then 
it will read these streams until they are fully 
processed 
GABRIELE MODENA LEARNING HADOOP 2
Predictive modelling 
GABRIELE MODENA LEARNING HADOOP 2
Learning from data 
Predictive model = statistical learning! 
Simple = parallelizable! 
Garbage in = garbage out 
GABRIELE MODENA LEARNING HADOOP 2
Couple of things we can do 
1. Parameter tuning 
2. Feature engineering 
3. Learn on all data 
GABRIELE MODENA LEARNING HADOOP 2
Train against all data 
Ensamble methods (cooperative and competitive)! 
Avoid multi pass / iterations! 
Apply models to live data! 
Keep models up to date 
GABRIELE MODENA LEARNING HADOOP 2
Off the shelf 
Apache Mahout (MapReduce, Spark) ! 
MLlib (Spark)! 
Cascading-pattern (MapReduce, Tez, Spark) 
GABRIELE MODENA LEARNING HADOOP 2
Apache Mahout 0.9 
Once the default solution for ML with MapReduce! 
Quality may vary! 
Good components are really good! 
Is it a library? A framework? A recommendation 
system? 
GABRIELE MODENA LEARNING HADOOP 2
The good 
The go-to if you need a Recommendation System! 
SGD (optimization)! 
Random Forest (classification/regression)! 
SVD (feature engineering)! 
ALS (collaborative filtering) 
GABRIELE MODENA LEARNING HADOOP 2
The puzzling 
SVM? ! 
Model updates are implementation specific!! 
Feature encoding and input format are often 
model specific 
GABRIELE MODENA LEARNING HADOOP 2
Apache Mahout trunk 
Moving away from MapReduce! 
Spark + Scala DSL = new classes of algorithms! 
Major code cleanup 
GABRIELE MODENA LEARNING HADOOP 2
It needs major 
infrastructure work 
around it
batch + streaming
There’s a buzzword for that 
http://lambda-architecture.net/ 
GABRIELE MODENA LEARNING HADOOP 2
Wrap up
With hadoop 2 
Cluster as an Operating System! 
YARN, mostly! 
Multiparadigm, better interop! 
Same system, different tools, multiple use cases! 
Batch + interactive 
GABRIELE MODENA LEARNING HADOOP 2
This said 
Ops is where a lot of time goes! 
Building clusters is hard! 
Distro fragmentation! 
Bleeding edge rush! 
Heavy lifting needed 
GABRIELE MODENA LEARNING HADOOP 2
That’s all, folks
Thanks for having me
Let’s discuss

Contenu connexe

Tendances

Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyShital Kat
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech ProjectsJody Garnett
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Cassandra advanced data modeling
Cassandra advanced data modelingCassandra advanced data modeling
Cassandra advanced data modelingRomain Hardouin
 

Tendances (20)

Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Indexed Hive
Indexed HiveIndexed Hive
Indexed Hive
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Cassandra advanced data modeling
Cassandra advanced data modelingCassandra advanced data modeling
Cassandra advanced data modeling
 

En vedette

Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemrhatr
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Ashutosh Sonaliya
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsJohn Nestor
 
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式Xuan-Chao Huang
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsSatya Narayan
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 

En vedette (20)

Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
 
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 

Similaire à Full stack analytics with Hadoop 2

Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiUnmesh Baile
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 

Similaire à Full stack analytics with Hadoop 2 (20)

Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Hadoop
HadoopHadoop
Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Hadoop
HadoopHadoop
Hadoop
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 

Dernier

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Dernier (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Full stack analytics with Hadoop 2

  • 1. Full stack analytics with Hadoop 2 Trento, 2014-09-11 GABRIELE MODENA LEARNING HADOOP 2
  • 2. CS.ML! Data Scientist ML & Data Mining Academia & Industry ! Learning Hadoop 2 for Packt_Publishing (together with Garry Turkington). TBD.
  • 3. This talk is about tools
  • 5. I will avoid benchmarks
  • 6. Back in 2012 GABRIELE MODENA LEARNING HADOOP 2
  • 7. HDFS Name Node Data Node ! ! Google paper (2003)! Distributed storage! Block ops Name Node Data Node Data Node GABRIELE MODENA LEARNING HADOOP 2
  • 8. MapReduce Google paper (2006)! Divide and conquer functional model! Concepts from database research! Batch worloads! Aggregation operations (eg. GROUP BY) GABRIELE MODENA LEARNING HADOOP 2
  • 9. Two phases Map Reduce GABRIELE MODENA LEARNING HADOOP 2
  • 12. All in all Great when records (jobs) are independent! Composability monsters! Computation vs. Communication tradeoff! Low level API! Tuning required GABRIELE MODENA LEARNING HADOOP 2
  • 13. Computation with MapReduce CRUNCH GABRIELE MODENA LEARNING HADOOP 2
  • 14. Higher level abstractions, still geared towards batch loads
  • 15. Dremel (Impala, Drill) Google paper (2010) ! Access blocks directly from data nodes (partition the fs namespace)! Columnar store (optimize for OLAP)! Appeals to database / BI crowds! Ridiculously fast (as long as you have memory) GABRIELE MODENA LEARNING HADOOP 2
  • 16. Computation beyond MapReduce Iterative workloads! Low latency queries! Real-time computation! High level abstractions GABRIELE MODENA LEARNING HADOOP 2
  • 17. Hadoop 2 Applications (Hive, Pig, Crunch, Cascading, etc…) Streaming (storm, spark, samza) In memory (spark) Interactive (Tez) HPC (MPI) Resource Management (YARN) HDFS Batch (MapReduce) Graph (giraph) GABRIELE MODENA LEARNING HADOOP 2
  • 18.
  • 19. Tez (Dryad) Microsoft paper (2007)! Generalization of MapReduce as dataflow! Express dependencies, I/O pipelining! Low level API for building DAGs! Mainly an execution engine (Hive-on-Tez, Pig-on-Tez) GABRIELE MODENA LEARNING HADOOP 2
  • 21. DAG dag = new DAG("WordCount"); dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge( new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); GABRIELE MODENA LEARNING HADOOP 2
  • 22. p!!ackage org.apache.tez.mapreduce.examples; import java.io.IOException; import java.util.Map; import java.util.StringTokenizer; i!mport java.util.TreeMap; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileAlreadyExistsException; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.security.UserGroupInformation; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.yarn.api.records.LocalResource; import org.apache.tez.client.TezClient; import org.apache.tez.dag.api.DAG; import org.apache.tez.dag.api.Edge; import org.apache.tez.dag.api.InputDescriptor; import org.apache.tez.dag.api.OutputDescriptor; import org.apache.tez.dag.api.ProcessorDescriptor; import org.apache.tez.dag.api.TezConfiguration; import org.apache.tez.dag.api.Vertex; import org.apache.tez.dag.api.client.DAGClient; import org.apache.tez.dag.api.client.DAGStatus; import org.apache.tez.mapreduce.committer.MROutputCommitter; import org.apache.tez.mapreduce.common.MRInputAMSplitGenerator; import org.apache.tez.mapreduce.hadoop.MRHelpers; import org.apache.tez.mapreduce.input.MRInput; import org.apache.tez.mapreduce.output.MROutput; import org.apache.tez.mapreduce.processor.SimpleMRProcessor; import org.apache.tez.runtime.api.Output; import org.apache.tez.runtime.library.api.KeyValueReader; import org.apache.tez.runtime.library.api.KeyValueWriter; import org.apache.tez.runtime.library.api.KeyValuesReader; i!mport org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfigurer; import com.google.common.base.Preconditions; import org.apache.tez.runtime.library.partitioner.HashPartitioner; !! public class WordCount extends Configured implements Tool { public static class TokenProcessor extends SimpleMRProcessor { IntWritable one = new IntWritable(1); ! Text word = new Text(); @Override public void run() throws Exception { Preconditions.checkArgument(getInputs().size() == 1); Preconditions.checkArgument(getOutputs().size() == 1); MRInput input = (MRInput) getInputs().values().iterator().next(); KeyValueReader kvReader = input.getReader(); Output output = getOutputs().values().iterator().next(); KeyValueWriter kvWriter = (KeyValueWriter) output.getWriter(); while (kvReader.next()) { StringTokenizer itr = new StringTokenizer(kvReader.getCurrentValue().toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); kvWriter.write(word, one); } } ! } ! } public static class SumProcessor extends SimpleMRProcessor { @Override public void run() throws Exception { Preconditions.checkArgument(getInputs().size() == 1); MROutput out = (MROutput) getOutputs().values().iterator().next(); KeyValueWriter kvWriter = out.getWriter(); KeyValuesReader kvReader = (KeyValuesReader) getInputs().values().iterator().next() .getReader(); while (kvReader.next()) { Text word = (Text) kvReader.getCurrentKey(); int sum = 0; for (Object value : kvReader.getCurrentValues()) { sum += ((IntWritable) value).get(); } kvWriter.write(word, new IntWritable(sum)); } } } ! private DAG createDAG(FileSystem fs, TezConfiguration tezConf, Map<String, LocalResource> localResources, Path stagingDir, ! String inputPath, String outputPath) throws IOException { Configuration inputConf = new Configuration(tezConf); inputConf.set(FileInputFormat.INPUT_DIR, inputPath); InputDescriptor id = new InputDescriptor(MRInput.class.getName()) .setUserPayload(MRInput.createUserPayload(inputConf, ! TextInputFormat.class.getName(), true, true)); Configuration outputConf = new Configuration(tezConf); outputConf.set(FileOutputFormat.OUTDIR, outputPath); OutputDescriptor od = new OutputDescriptor(MROutput.class.getName()) .setUserPayload(MROutput.createUserPayload( ! outputConf, TextOutputFormat.class.getName(), true)); Vertex tokenizerVertex = new Vertex("tokenizer", new ProcessorDescriptor( TokenProcessor.class.getName()), -1, MRHelpers.getMapResource(tezConf)); ! tokenizerVertex.addInput("MRInput", id, MRInputAMSplitGenerator.class); Vertex summerVertex = new Vertex("summer", ! new ProcessorDescriptor( SumProcessor.class.getName()), 1, MRHelpers.getReduceResource(tezConf)); summerVertex.addOutput("MROutput", od, MROutputCommitter.class); OrderedPartitionedKVEdgeConfigurer edgeConf = OrderedPartitionedKVEdgeConfigurer .newBuilder(Text.class.getName(), IntWritable.class.getName(), ! HashPartitioner.class.getName(), null).build(); DAG dag = new DAG("WordCount"); dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge( return dag; ! } private static void printUsage() { new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty())); System.err.println("Usage: " + " wordcount <in1> <out1>"); ToolRunner.printGenericCommandUsage(System.err); ! } public boolean run(String inputPath, String outputPath, Configuration conf) throws Exception { System.out.println("Running WordCount"); // conf and UGI TezConfiguration tezConf; if (conf != null) { tezConf = new TezConfiguration(conf); } else { tezConf = new TezConfiguration(); } UserGroupInformation.setConfiguration(tezConf); ! String user = UserGroupInformation.getCurrentUser().getShortUserName(); // staging dir FileSystem fs = FileSystem.get(tezConf); String stagingDirStr = Path.SEPARATOR + "user" + Path.SEPARATOR + user + Path.SEPARATOR+ ".staging" + Path.SEPARATOR + Path.SEPARATOR + Long.toString(System.currentTimeMillis()); Path stagingDir = new Path(stagingDirStr); tezConf.set(TezConfiguration.TEZ_AM_STAGING_DIR, stagingDirStr); stagingDir = fs.makeQualified(stagingDir); // No need to add jar containing this class as assumed to be part of ! // the tez jars. // TEZ-674 Obtain tokens based on the Input / Output paths. For now assuming staging dir // is the same filesystem as the one used for Input/Output. TezClient tezSession = new TezClient("WordCountSession", tezConf); ! tezSession.start(); ! DAGClient dagClient = null; try { if (fs.exists(new Path(outputPath))) { throw new FileAlreadyExistsException("Output directory " + outputPath + " already exists"); } Map<String, LocalResource> localResources = new TreeMap<String, LocalResource>(); DAG dag = createDAG(fs, tezConf, localResources, ! stagingDir, inputPath, outputPath); tezSession.waitTillReady(); ! dagClient = tezSession.submitDAG(dag); // monitoring DAGStatus dagStatus = dagClient.waitForCompletionWithAllStatusUpdates(null); if (dagStatus.getState() != DAGStatus.State.SUCCEEDED) { System.out.println("DAG diagnostics: " + dagStatus.getDiagnostics()); return false; } return true; } finally { fs.delete(stagingDir, true); tezSession.stop(); } ! } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); ! String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { printUsage(); return 2; } WordCount job = new WordCount(); job.run(otherArgs[0], otherArgs[1], conf); return 0; ! } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(res); } } GABRIELE MODENA LEARNING HADOOP 2
  • 23. Spark AMPLab paper (2010), builds on Dryad! Resilient Distributed Datasets (RDDs)! High level API (and a repl)! Also an execution engine (Hive-on-Spark, Pig-on- Spark) GABRIELE MODENA LEARNING HADOOP 2
  • 24. JavaRDD<String> file = spark.textFile(“hdfs://infile.txt"); ! JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); ! JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); ! JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); ! counts.saveAsTextFile(“hdfs://outfile.txt"); GABRIELE MODENA LEARNING HADOOP 2
  • 25. Rule of thumb Avoid spill-to-disk! Spark and Tez don’t mix well! Join on 50+ TB = Hive+Tez, MapReduce! Direct access to API (in memory) = Spark! OLAP = Hive+Tez, Cloudera Impala! GABRIELE MODENA LEARNING HADOOP 2
  • 26. Good stuff. So what?
  • 27. The data <adjective> S3, mysql, nfs, … HDFS Workflow coordination Ingestion Metadata Processing GABRIELE MODENA LEARNING HADOOP 2
  • 28. Analytics on Hadoop 2 Batch & interactive! Datawarehousing & computing! Dataset size and velocity! Integrations with existing tools! Distributions will constrain your stack GABRIELE MODENA LEARNING HADOOP 2
  • 29. Use cases Datawarehousing! Explorative Data Analysis! Stream processing! Predictive Analytics GABRIELE MODENA LEARNING HADOOP 2
  • 30. Datawarehousing Data ingestion! Pipelines! Transform and enrich (ETL) queries - batch! Low latency (presentation) queries - interactive! Interoperable data formats and metadata! Workflow Orchestration GABRIELE MODENA LEARNING HADOOP 2
  • 31. Collection and ingestion $ hadoop distcp GABRIELE MODENA LEARNING HADOOP 2
  • 32. Once data is in HDFS
  • 33. Apache Hive HiveQL ! Data stored on HDFS! Metadata kept in mysql (metastore)! Metadata exposed to third parties (HCatalog)! Suitable both for interactive and batch queries GABRIELE MODENA LEARNING HADOOP 2
  • 36. The nature of Hive tables CREATE TABLE and (LOAD DATA) produce metadata! ! Schema based on the data “as it has already arrived”! ! Data files underlying a Hive table are no different from any other file on HDFS! ! Primitive types behave as in Java GABRIELE MODENA LEARNING HADOOP 2
  • 37. Data formats Record oriented (avro, text)! Column oriented (Parquet, Orc) GABRIELE MODENA LEARNING HADOOP 2
  • 38. Text (tab separated) create external table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION ‘$input’ $ hadoop fs -cat /data/tweets.tsv 2014-03-12T17:34:26.000Z!443802208698908672! Oh &amp; I'm chuffed for @GeraintThomas86, doing Wales proud in yellow!! #ParisNice #Cymru! NULL! 223224878! NULL 2014-03-12T17:34:26.000Z!443802208706908160! Stalker48 Kembali Lagi Cek Disini http://t.co/4BMTFByFH5 236! NULL! 629845435! NULL 2014-03-12T17:34:26.000Z!443802208728268800! @Piconn ou melhor, eu era :c mudei! NULL! 255768055! NULL 2014-03-12T17:34:26.000Z!443802208698912768! I swear Ryan's always in his own world. He's always like 4 hours behind everyone else.! NULL! 2379282889! NULL 2014-03-12T17:34:26.000Z!443802208702713856! @maggersforever0 lmfao you gotta see this, its awesome http://t.co/1PvXEELlqi! NULL! 355858832! NULL 2014-03-12T17:34:26.000Z!443802208698896384! Crazy... http://t.co/ G4QRMSKGkh! NULL! 106439395! NULL! GABRIELE MODENA LEARNING HADOOP 2 •
  • 40. Apache Avro Record oriented! Migrations (forward, backward)! Schema on write! Interoperability { “namespace”: “com.mycompany.avrotables”, "name": "tweets", "type": "record", "fields": [ {"name": "created_at", "type": "string", “doc”: “date_time of tweet”}, {"name": "tweet_id_str", "type": "string"}, {"name": "text", "type": "string"}, {"name": "in_reply_to", "type": ["string", "null"]}, {"name": "is_retweeted", "type": ["string", "null"]}, {"name": "user_id", "type": "string"}, {"name": "place_id", "type": ["string", "null"]} ] } CREATE TABLE tweets ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' SERDEPROPERTIES ( 'avro.schema.url'='hdfs:///schema/avro/tweets_avro.avsc' ) ; insert into table tweets select * from tweets_ext; GABRIELE MODENA LEARNING HADOOP 2
  • 41. Some thoughts on schemas Only make additive changes! Think about schema distribution! Manage schema versions explicitly GABRIELE MODENA LEARNING HADOOP 2
  • 42. Parquet ! Ad hoc use case! Cloudera Impala’s default file format! Execution engine agnostic! HIVE-5783! Let it handle block size! ! create table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) STORED AS PARQUET; ! insert into table tweets select * from tweets_ext; GABRIELE MODENA LEARNING HADOOP 2
  • 44. Table Optimization Create tables with workloads in mind! Partitions! Bucketing! Join strategies GABRIELE MODENA LEARNING HADOOP 2
  • 45. Plenty of tunables !! # partitions SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions.pernode=10000; SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.created.files=1000000; ! # merge small files SET hive.merge.size.per.task=256000000; SET hive.merge.mapfiles=true; SET hive.merge.mapredfiles=true; SET hive.merge.smallfiles.avgsize=16000000; # Compression SET mapred.output.compress=true; SET mapred.output.compression.type=BLOCK; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; GABRIELE MODENA LEARNING HADOOP 2
  • 46. Apache Oozie Data pipelines! Workflow execution and coordination! Time and availability based execution! Configuration over code! MapReduce centric! Actions Hive, Pig, fs, shell, sqoop <workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">! ...! <action name="[NODE-NAME]">! <hive xmlns="uri:oozie:hive-action:0.2">! <job-tracker>[JOB-TRACKER]</job-tracker>! <name-node>[NAME-NODE]</name-node>! <prepare>! <delete path="[PATH]"/>! ...! <mkdir path="[PATH]"/>! ...! </prepare>! <job-xml>[HIVE SETTINGS FILE]</job-xml>! <configuration>! <property>! <name>[PROPERTY-NAME]</name>! <value>[PROPERTY-VALUE]</value>! </property>! ...! </configuration>! <script>[HIVE-SCRIPT]</script>! <param>[PARAM-VALUE]</param>! ...! <param>[PARAM-VALUE]</param>! <file>[FILE-PATH]</file>! ...! <archive>[FILE-PATH]</archive>! ...! </hive>! <ok to="[NODE-NAME]"/>! <error to="[NODE-NAME]"/>! </action>! ...! </workflow-app> GABRIELE MODENA LEARNING HADOOP 2
  • 47. EDA Luminosity in xkcd comics (courtesy of rbloggers) GABRIELE MODENA LEARNING HADOOP 2
  • 50. Spark & Ipython Notebook ! ! from pyspark import SparkContext! ! sc = SparkContext(CLUSTER_URL, ‘ipython-notebook') ! Works with Avro, Parqeut etc! Move computation close to data! Numpy, scikit-learn, matplotlib! Setup can be tedious GABRIELE MODENA LEARNING HADOOP 2
  • 51. Stream processing Statistics in real time! Data feeds! Machine generated (sensor data, logs)! Predictive analytics GABRIELE MODENA LEARNING HADOOP 2
  • 52. Several niches Low latency (storm, s4)! Persistency and resiliency (samza)! Apply complex logic (spark-streaming)! Type of message stream (kafka) GABRIELE MODENA LEARNING HADOOP 2
  • 53. Apache Samza Kafka for streaming ! Yarn for resource management and exec! Samza API for processing! Sweet spot: second, minutes Samza API Yarn Kafka GABRIELE MODENA LEARNING HADOOP 2
  • 54. public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator)
  • 55. public void window( MessageCollector collector, TaskCoordinator coordinator)
  • 56. Bootstrap streams Samza can consume messages from multiple streams! Rewind on historical data does not preserve ordering! If a task has any bootstrap streams defined then it will read these streams until they are fully processed GABRIELE MODENA LEARNING HADOOP 2
  • 57. Predictive modelling GABRIELE MODENA LEARNING HADOOP 2
  • 58. Learning from data Predictive model = statistical learning! Simple = parallelizable! Garbage in = garbage out GABRIELE MODENA LEARNING HADOOP 2
  • 59. Couple of things we can do 1. Parameter tuning 2. Feature engineering 3. Learn on all data GABRIELE MODENA LEARNING HADOOP 2
  • 60. Train against all data Ensamble methods (cooperative and competitive)! Avoid multi pass / iterations! Apply models to live data! Keep models up to date GABRIELE MODENA LEARNING HADOOP 2
  • 61. Off the shelf Apache Mahout (MapReduce, Spark) ! MLlib (Spark)! Cascading-pattern (MapReduce, Tez, Spark) GABRIELE MODENA LEARNING HADOOP 2
  • 62. Apache Mahout 0.9 Once the default solution for ML with MapReduce! Quality may vary! Good components are really good! Is it a library? A framework? A recommendation system? GABRIELE MODENA LEARNING HADOOP 2
  • 63. The good The go-to if you need a Recommendation System! SGD (optimization)! Random Forest (classification/regression)! SVD (feature engineering)! ALS (collaborative filtering) GABRIELE MODENA LEARNING HADOOP 2
  • 64. The puzzling SVM? ! Model updates are implementation specific!! Feature encoding and input format are often model specific GABRIELE MODENA LEARNING HADOOP 2
  • 65. Apache Mahout trunk Moving away from MapReduce! Spark + Scala DSL = new classes of algorithms! Major code cleanup GABRIELE MODENA LEARNING HADOOP 2
  • 66. It needs major infrastructure work around it
  • 68. There’s a buzzword for that http://lambda-architecture.net/ GABRIELE MODENA LEARNING HADOOP 2
  • 70. With hadoop 2 Cluster as an Operating System! YARN, mostly! Multiparadigm, better interop! Same system, different tools, multiple use cases! Batch + interactive GABRIELE MODENA LEARNING HADOOP 2
  • 71. This said Ops is where a lot of time goes! Building clusters is hard! Distro fragmentation! Bleeding edge rush! Heavy lifting needed GABRIELE MODENA LEARNING HADOOP 2