SlideShare une entreprise Scribd logo
1  sur  49
1
Hadoop Puzzlers
Aaron Myers & Daniel Templeton
Cloudera, Inc.
2
Your Hosts
Aaron “ATM” Myers
• AKA “Cash Money”
• Software Engineer
• Apache Hadoop
Committer
Daniel Templeton
• Certification Developer
• Crusty, old HPC guy
• Likes Perl
©2014 Cloudera, Inc. All rights reserved.2
3
What is a Hadoop Puzzler
©2014 Cloudera, Inc. All rights reserved.3
• Shameless knockoff of Josh Bloch’s Java Puzzlers talks
• We’ll walk through a puzzle
• You vote on the answer
• We all learn a valuable lesson
4 ©2014 Cloudera, Inc. All rights reserved.4
Let’s try it, OK?
5
An Easy One
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
IntWritable max =
new IntWritable(0);
for (IntWritable v: values)
if (v.get() > max.get())
max = v;
c.write(key, max);
} }
©2014 Cloudera, Inc. All rights reserved.5
6
An Easy One
The data:
A,1
A,5
A,3
The results:
a) A 5
b) A 1
c) A 3
d) The job fails
©2014 Cloudera, Inc. All rights reserved.6
7
An Easy One
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
IntWritable max =
new IntWritable(0);
for (IntWritable v: values)
if (v.get() > max.get())
max = v;
c.write(key, max);
} }
©2014 Cloudera, Inc. All rights reserved.7
A 1
A 5
A 3
8
An Easy One
The data:
A,1
A,5
A,3
The results:
a) A 5
b) A 1
c) A 3
d) The job fails
©2014 Cloudera, Inc. All rights reserved.8
9
An Easy One (Answer)
The data:
A,1
A,5
A,3
The results:
a) A 5
b) A 1
c) A 3
d) The job fails
©2014 Cloudera, Inc. All rights reserved.9
10
An Easy One (Problem)
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
IntWritable max =
new IntWritable(0);
for (IntWritable v: values)
if (v.get() > max.get())
max = v;
c.write(key, max);
} }
©2014 Cloudera, Inc. All rights reserved.10
11
An Easy One (Moral)
©2014 Cloudera, Inc. All rights reserved.11
• MapReduce reuses Writables whenever it can
• That includes while iterating through the values
• Always be careful to only store the value instead of
the Writable!
12
A Sinking Feeling
public class AsyncSubmit
extends Configured
implements Tool {
public static void main(String[] args)
throws Exception {
int ret = ToolRunner.run(
new Configuration(),
new AsyncSubmit(), args);
System.exit(ret);
}
public int run(String[] args)
throws Exception {
Job job = Job.getInstance(getConf());
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.waitForCompletion(false);
return job.isComplete() ? 1 : 0;
} }
©2014 Cloudera, Inc. All rights reserved.12
13
A Sinking Feeling
The data:
The complete works of
William Shakespeare
The results:
a) Fails to compile
b) The job fails
c) Exits with 0
d) Exits with 1
©2014 Cloudera, Inc. All rights reserved.13
14
A Sinking Feeling
public class AsyncSubmit
extends Configured
implements Tool {
public static void main(String[] args)
throws Exception {
int ret = ToolRunner.run(
new Configuration(),
new AsyncSubmit(), args);
System.exit(ret);
}
public int run(String[] args)
throws Exception {
Job job = Job.getInstance(getConf());
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.waitForCompletion(false);
return job.isComplete() ? 1 : 0;
} }
©2014 Cloudera, Inc. All rights reserved.14
The complete works of
William Shakespeare
15
A Sinking Feeling
The data:
The complete works of
William Shakespeare
The results:
a) Fails to compile
b) The job fails
c) Exits with 0
d) Exits with 1
©2014 Cloudera, Inc. All rights reserved.15
16
A Sinking Feeling (Answer)
The data:
The complete works of
William Shakespeare
The results:
a) Fails to compile
b) The job fails
c) Exits with 0
d) Exits with 1
©2014 Cloudera, Inc. All rights reserved.16
17
A Sinking Feeling (Problem)
public class AsyncSubmit
extends Configured
implements Tool {
public static void main(String[] args)
throws Exception {
int ret = ToolRunner.run(
new Configuration(),
new AsyncSubmit(), args);
System.exit(ret);
}
public int run(String[] args)
throws Exception {
Job job = Job.getInstance(getConf());
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.waitForCompletion(false);
return job.isComplete() ? 1 : 0;
} }
©2014 Cloudera, Inc. All rights reserved.17
18
A Sinking Job (Moral)
©2014 Cloudera, Inc. All rights reserved.18
• Read the API docs!
• Sometimes the obvious meanings of methods and
parameters aren’t correct
• Parameter for waitForCompletion() controls whether
status output is printed
• Driver does wait for job to exit but does not print all the job
status information
19
Do-over
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduceRedux
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
int max = 0;
for (IntWritable v: values)
if (v.get() > max)
max = v.get();
c.write(key, new IntWritable(max));
} }
©2014 Cloudera, Inc. All rights reserved.19
20
Do-over
The data:
A,1
A,5
The results:
a) A 5
b) A 1
c) A 1
A 5
d) The job fails
©2014 Cloudera, Inc. All rights reserved.20
21
Do-over
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduceRedux
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
int max = 0;
for (IntWritable v: values)
if (v.get() > max)
max = v.get();
c.write(key, new IntWritable(max));
} }
©2014 Cloudera, Inc. All rights reserved.21
A 1
A 5
22
Do-over
The data:
A,1
A,5
The results:
a) A 5
b) A 1
c) A 1
A 5
d) The job fails
©2014 Cloudera, Inc. All rights reserved.22
23
Do-over (Answer)
The data:
A,1
A,5
The results:
a) A 5
b) A 1
c) A 1
A 5
d) The job fails
©2014 Cloudera, Inc. All rights reserved.23
24
Do-over (Problem)
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduceRedux
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
int max = 0;
for (IntWritable v: values)
if (v.get() > max)
max = v.get();
c.write(key, new IntWritable(max));
} }
©2014 Cloudera, Inc. All rights reserved.24
25
Do-over (Moral)
©2014 Cloudera, Inc. All rights reserved.25
• Mismatched signatures can lead to unexpected
behaviors because of exposed base class method
implementations
• ALWAYS use @Override!
26
Joining Forces
hive> DESCRIBE table1;
OK
id int
phone string
state string
Time taken: 0.236 seconds
hive> DESCRIBE table2;
OK
id int
city string
state string
Time taken: 0.116 seconds
hive> CREATE TABLE table3 AS SELECT
table2.*,table1.phone,table1.state
AS s FROM table1 JOIN table2 ON
(table1.id == table2.id);
…
hive> EXPORT TABLE table3 TO
'/user/cloudera/table3.csv';
…
hive> exit
$ hadoop fs –cat table3.csv |
head -1 | tr , 'n' | wc –l
©2014 Cloudera, Inc. All rights reserved.26
27
Joining Forces
The data:
hive> SELECT * FROM table1;
OK
1 6506506500 CA
2 2282282280 MS
Time taken: 1.006 seconds
hive> SELECT * FROM table2;
OK
1 Palo Alto CA
2 Gautier MS
Time taken: 1.202 seconds
The results:
a) 5
b) 4
c) 1
d) The join fails
©2014 Cloudera, Inc. All rights reserved.27
28
Joining Forces
hive> DESCRIBE table1;
OK
id int
phone string
state string
Time taken: 0.236 seconds
hive> DESCRIBE table2;
OK
id int
city string
state string
Time taken: 0.116 seconds
hive> CREATE TABLE table3 AS SELECT
table2.*,table1.phone,table1.state
AS s FROM table1 JOIN table2 ON
(table1.id == table2.id);
…
hive> EXPORT TABLE table3 TO
'/user/cloudera/table3.csv';
…
hive> exit
$ hadoop fs –cat table3.csv |
head -1 | tr , 'n' | wc –l
©2014 Cloudera, Inc. All rights reserved.28
1 6506506500 CA
2 2282282280 MS
1 Palo Alto CA
2 Gautier MS
29
Joining Forces
The data:
hive> SELECT * FROM table1;
OK
1 6506506500 CA
2 2282282280 MS
Time taken: 1.006 seconds
hive> SELECT * FROM table2;
OK
1 Palo Alto CA
2 Gautier MS
Time taken: 1.202 seconds
The results:
a) 5
b) 4
c) 1
d) The join fails
©2014 Cloudera, Inc. All rights reserved.29
30
Joining Forces (Answer)
The data:
hive> SELECT * FROM table1;
OK
1 6506506500 CA
2 2282282280 MS
Time taken: 1.006 seconds
hive> SELECT * FROM table2;
OK
1 Palo Alto CA
2 Gautier MS
Time taken: 1.202 seconds
The results:
a) 5
b) 4
c) 1
d) The join fails
©2014 Cloudera, Inc. All rights reserved.30
31
Joining Forces (Problem)
hive> DESCRIBE table1;
OK
id int
phone string
state string
Time taken: 0.236 seconds
hive> DESCRIBE table2;
OK
id int
city string
state string
Time taken: 0.116 seconds
hive> CREATE TABLE table3 AS SELECT
table2.*,table1.phone,table1.state
AS s FROM table1 JOIN table2 ON
(table1.id == table2.id);
…
hive> EXPORT TABLE table3 TO
'/user/cloudera/table3.csv';
…
hive> exit
$ hadoop fs –cat table3.csv |
head -1 | tr , 'n' | wc –l
©2014 Cloudera, Inc. All rights reserved.31
32
Joining Forces (Moral)
©2014 Cloudera, Inc. All rights reserved.32
• Hive’s default delimiter is 0x01 (CTRL-A)
• Easy to assume export will use a sane delimiter – it
doesn’t
• Incidentally, Hive’s join rules are pretty sane and work
as you’d expect
33
Close Enough
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class Top20Reduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
float max = 0.0f;
for (IntWritable v: values)
if (v.get() > max) max = v.get();
max *= 0.8f;
for (IntWritable v: values)
if (v.get() >= max)
c.write(key, v);
} }
©2014 Cloudera, Inc. All rights reserved.33
34
Close Enough
The data:
A,1
A,5
A,4
The results:
a)
b) A 5
c) A 5
A 4
d) The job fails
©2014 Cloudera, Inc. All rights reserved.34
35
Close Enough
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class Top20Reduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
float max = 0.0f;
for (IntWritable v: values)
if (v.get() > max) max = v.get();
max *= 0.8f;
for (IntWritable v: values)
if (v.get() >= max)
c.write(key, v);
} }
©2014 Cloudera, Inc. All rights reserved.35
A 1
A 5
A 4
36
Close Enough
The data:
A,1
A,5
A,4
The results:
a)
b) A 5
c) A 5
A 4
d) The job fails
©2014 Cloudera, Inc. All rights reserved.36
37
Close Enough (Answer)
The data:
A,1
A,5
A,4
The results:
a)
b) A 5
c) A 5
A 4
d) The job fails
©2014 Cloudera, Inc. All rights reserved.37
38
Close Enough (Problem)
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class Top20Reduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
float max = 0.0f;
for (IntWritable v: values)
if (v.get() > max) max = v.get();
max *= 0.8f;
for (IntWritable v: values)
if (v.get() >= max)
c.write(key, v);
} }
©2014 Cloudera, Inc. All rights reserved.38
39
Close Enough (Moral)
©2014 Cloudera, Inc. All rights reserved.39
• For scalability reasons, the values iterable is
single-shot
• Subsequent iterators iterate over an empty collection
• Store values (not Writables!) in the first pass
• Better yet, restructure the logic to avoid storing all
values in memory
40
Overbyte
public class MinLineMap
extends Mapper<LongWritable,
Text,Text,Text> {
Text k = new Text();
protected void map(LongWritable key,
Text value, Context c) … {
String val = value.toString();
k.set(val.substring(0, 1));
c.write(k, value);
} }
public class MinLineReduce
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<Text> values,
Context c) … {
int min = Integer.MAX_VALUE;
for (Text v: values)
if (v.getBytes().length < min)
min = v.getBytes().length;
c.write(key, new IntWritable(min));
} }
©2014 Cloudera, Inc. All rights reserved.40
41
Overbyte
The data:
Hadoop
Spark
Hive
Sqoop2
The results:
a) H 4
S 5
b) H 6
S 5
c) H 6
S 6
d) The job fails
©2014 Cloudera, Inc. All rights reserved.41
42
Overbyte
public class MinLineMap
extends Mapper<LongWritable,
Text,Text,Text> {
Text k = new Text();
protected void map(LongWritable key,
Text value, Context c) … {
String val = value.toString();
k.set(val.substring(0, 1));
c.write(k, value);
} }
public class MinLineReduce
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<Text> values,
Context c) … {
int min = Integer.MAX_VALUE;
for (Text v: values)
if (v.getBytes().length < min)
min = v.getBytes().length;
c.write(key, new IntWritable(min));
} }
©2014 Cloudera, Inc. All rights reserved.42
Hadoop
Spark
Hive
Sqoop2
43
Overbyte
The data:
Hadoop
Spark
Hive
Sqoop2
The results:
a) H 4
S 5
b) H 6
S 5
c) H 6
S 6
d) The job fails
©2014 Cloudera, Inc. All rights reserved.43
44
Overbyte (Answer)
The data:
Hadoop
Spark
Hive
Sqoop2
The results:
a) H 4
S 5
b) H 6
S 5
c) H 6
S 6
d) The job fails
©2014 Cloudera, Inc. All rights reserved.44
45
Overbyte (Problem)
public class MinLineMap
extends Mapper<LongWritable,
Text,Text,Text> {
Text k = new Text();
protected void map(LongWritable key,
Text value, Context c) … {
String val = value.toString();
k.set(val.substring(0, 1));
c.write(k, value);
} }
public class MinLineReduce
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<Text> values,
Context c) … {
int min = Integer.MAX_VALUE;
for (Text v: values)
if (v.getBytes().length < min)
min = v.getBytes().length;
c.write(key, new IntWritable(min));
} }
©2014 Cloudera, Inc. All rights reserved.45
46
Overbyte (Moral)
©2014 Cloudera, Inc. All rights reserved.46
• Writables get reused in loops
• In addition, Text.getBytes() reuses byte array
allocated by previous calls
• Net result is wrongness
• Text.getLength() is the correct way to get the length
of a Text.
47
What We Learned
©2014 Cloudera, Inc. All rights reserved.47
• Beware of reuse of Writables
• Always use @Override so your compiler can help you
• Don’t assume you know what a method does because
of the name or parameters – read the docs!
• Sometimes scalability is inconvenient
48
One Closing Note
©2014 Cloudera, Inc. All rights reserved.48
• Hadoop is still not easy
• Being good takes effort and experience
• Recognizing Hadoop talent can be hard
• Cloudera’s is working to make Hadoop talent easier to
recognize through certification
http://cloudera.com/content/cloudera/en/training/cert
ification.html
49 ©2014 Cloudera, Inc. All rights reserved.
Aaron Myers &
Daniel Templeton

Contenu connexe

Tendances

Vielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
Vielseitiges In-Memory Computing mit Apache Ignite und KubernetesVielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
Vielseitiges In-Memory Computing mit Apache Ignite und KubernetesQAware GmbH
 
Getting Started with Datatsax .Net Driver
Getting Started with Datatsax .Net DriverGetting Started with Datatsax .Net Driver
Getting Started with Datatsax .Net DriverDataStax Academy
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Michaël Figuière
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring frameworkSunghyouk Bae
 
Showdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp KrennShowdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp KrennJavaDayUA
 
.NET Multithreading and File I/O
.NET Multithreading and File I/O.NET Multithreading and File I/O
.NET Multithreading and File I/OJussi Pohjolainen
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersMichaël Figuière
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQLPeter Eisentraut
 
Hazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSHazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSuzquiano
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in JavaDoug Hawkins
 
Vavr Java User Group Rheinland
Vavr Java User Group RheinlandVavr Java User Group Rheinland
Vavr Java User Group RheinlandDavid Schmitz
 
Rx 101 Codemotion Milan 2015 - Tamir Dresher
Rx 101   Codemotion Milan 2015 - Tamir DresherRx 101   Codemotion Milan 2015 - Tamir Dresher
Rx 101 Codemotion Milan 2015 - Tamir DresherTamir Dresher
 
Building responsive application with Rx - confoo - tamir dresher
Building responsive application with Rx - confoo - tamir dresherBuilding responsive application with Rx - confoo - tamir dresher
Building responsive application with Rx - confoo - tamir dresherTamir Dresher
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010Ben Scofield
 

Tendances (20)

Vielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
Vielseitiges In-Memory Computing mit Apache Ignite und KubernetesVielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
Vielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
 
Getting Started with Datatsax .Net Driver
Getting Started with Datatsax .Net DriverGetting Started with Datatsax .Net Driver
Getting Started with Datatsax .Net Driver
 
ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring framework
 
Apex code benchmarking
Apex code benchmarkingApex code benchmarking
Apex code benchmarking
 
Showdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp KrennShowdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp Krenn
 
.NET Multithreading and File I/O
.NET Multithreading and File I/O.NET Multithreading and File I/O
.NET Multithreading and File I/O
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for Developers
 
Clojure: a LISP for the JVM
Clojure: a LISP for the JVMClojure: a LISP for the JVM
Clojure: a LISP for the JVM
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQL
 
Hazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSHazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMS
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in Java
 
Vavr Java User Group Rheinland
Vavr Java User Group RheinlandVavr Java User Group Rheinland
Vavr Java User Group Rheinland
 
Rx 101 Codemotion Milan 2015 - Tamir Dresher
Rx 101   Codemotion Milan 2015 - Tamir DresherRx 101   Codemotion Milan 2015 - Tamir Dresher
Rx 101 Codemotion Milan 2015 - Tamir Dresher
 
Building responsive application with Rx - confoo - tamir dresher
Building responsive application with Rx - confoo - tamir dresherBuilding responsive application with Rx - confoo - tamir dresher
Building responsive application with Rx - confoo - tamir dresher
 
Hadoop
HadoopHadoop
Hadoop
 
JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010
 

En vedette

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"Dr. Mirko Kämpf
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Taming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop EcosystemTaming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop EcosystemCloudera, Inc.
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Datainside-BigData.com
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Cloudera, Inc.
 
Debugging (Docker) containers in production
Debugging (Docker) containers in productionDebugging (Docker) containers in production
Debugging (Docker) containers in productionbcantrill
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in ImpalaCloudera, Inc.
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldCloudera, Inc.
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 

En vedette (18)

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Taming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop EcosystemTaming Operations in the Hadoop Ecosystem
Taming Operations in the Hadoop Ecosystem
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Data
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
 
Debugging (Docker) containers in production
Debugging (Docker) containers in productionDebugging (Docker) containers in production
Debugging (Docker) containers in production
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 

Similaire à Hadoop Puzzlers

Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraDeependra Ariyadewa
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Yulia Tsisyk
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net PerformanceCUSTIS
 
Celery - A Distributed Task Queue
Celery - A Distributed Task QueueCelery - A Distributed Task Queue
Celery - A Distributed Task QueueDuy Do
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable CodeBaidu, Inc.
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
 
C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8Christian Nagel
 
Using xUnit as a Swiss-Aarmy Testing Toolkit
Using xUnit as a Swiss-Aarmy Testing ToolkitUsing xUnit as a Swiss-Aarmy Testing Toolkit
Using xUnit as a Swiss-Aarmy Testing ToolkitChris Oldwood
 
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...Muhammad Ulhaque
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Andrew Yongjoon Kong
 
실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리Byeongweon Moon
 

Similaire à Hadoop Puzzlers (20)

Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
C# - What's next
C# - What's nextC# - What's next
C# - What's next
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
 
Celery - A Distributed Task Queue
Celery - A Distributed Task QueueCelery - A Distributed Task Queue
Celery - A Distributed Task Queue
 
Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
Anti patterns
Anti patternsAnti patterns
Anti patterns
 
TechTalk - Dotnet
TechTalk - DotnetTechTalk - Dotnet
TechTalk - Dotnet
 
C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8
 
Using xUnit as a Swiss-Aarmy Testing Toolkit
Using xUnit as a Swiss-Aarmy Testing ToolkitUsing xUnit as a Swiss-Aarmy Testing Toolkit
Using xUnit as a Swiss-Aarmy Testing Toolkit
 
RxJava on Android
RxJava on AndroidRxJava on Android
RxJava on Android
 
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
 
What is new in Java 8
What is new in Java 8What is new in Java 8
What is new in Java 8
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...
 
실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 

Dernier (20)

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 

Hadoop Puzzlers

  • 1. 1 Hadoop Puzzlers Aaron Myers & Daniel Templeton Cloudera, Inc.
  • 2. 2 Your Hosts Aaron “ATM” Myers • AKA “Cash Money” • Software Engineer • Apache Hadoop Committer Daniel Templeton • Certification Developer • Crusty, old HPC guy • Likes Perl ©2014 Cloudera, Inc. All rights reserved.2
  • 3. 3 What is a Hadoop Puzzler ©2014 Cloudera, Inc. All rights reserved.3 • Shameless knockoff of Josh Bloch’s Java Puzzlers talks • We’ll walk through a puzzle • You vote on the answer • We all learn a valuable lesson
  • 4. 4 ©2014 Cloudera, Inc. All rights reserved.4 Let’s try it, OK?
  • 5. 5 An Easy One public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.5
  • 6. 6 An Easy One The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.6
  • 7. 7 An Easy One public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.7 A 1 A 5 A 3
  • 8. 8 An Easy One The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.8
  • 9. 9 An Easy One (Answer) The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.9
  • 10. 10 An Easy One (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.10
  • 11. 11 An Easy One (Moral) ©2014 Cloudera, Inc. All rights reserved.11 • MapReduce reuses Writables whenever it can • That includes while iterating through the values • Always be careful to only store the value instead of the Writable!
  • 12. 12 A Sinking Feeling public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.12
  • 13. 13 A Sinking Feeling The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.13
  • 14. 14 A Sinking Feeling public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.14 The complete works of William Shakespeare
  • 15. 15 A Sinking Feeling The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.15
  • 16. 16 A Sinking Feeling (Answer) The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.16
  • 17. 17 A Sinking Feeling (Problem) public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.17
  • 18. 18 A Sinking Job (Moral) ©2014 Cloudera, Inc. All rights reserved.18 • Read the API docs! • Sometimes the obvious meanings of methods and parameters aren’t correct • Parameter for waitForCompletion() controls whether status output is printed • Driver does wait for job to exit but does not print all the job status information
  • 19. 19 Do-over public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.19
  • 20. 20 Do-over The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.20
  • 21. 21 Do-over public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.21 A 1 A 5
  • 22. 22 Do-over The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.22
  • 23. 23 Do-over (Answer) The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.23
  • 24. 24 Do-over (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.24
  • 25. 25 Do-over (Moral) ©2014 Cloudera, Inc. All rights reserved.25 • Mismatched signatures can lead to unexpected behaviors because of exposed base class method implementations • ALWAYS use @Override!
  • 26. 26 Joining Forces hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit $ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.26
  • 27. 27 Joining Forces The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.27
  • 28. 28 Joining Forces hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit $ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.28 1 6506506500 CA 2 2282282280 MS 1 Palo Alto CA 2 Gautier MS
  • 29. 29 Joining Forces The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.29
  • 30. 30 Joining Forces (Answer) The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.30
  • 31. 31 Joining Forces (Problem) hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit $ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.31
  • 32. 32 Joining Forces (Moral) ©2014 Cloudera, Inc. All rights reserved.32 • Hive’s default delimiter is 0x01 (CTRL-A) • Easy to assume export will use a sane delimiter – it doesn’t • Incidentally, Hive’s join rules are pretty sane and work as you’d expect
  • 33. 33 Close Enough public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.33
  • 34. 34 Close Enough The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.34
  • 35. 35 Close Enough public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.35 A 1 A 5 A 4
  • 36. 36 Close Enough The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.36
  • 37. 37 Close Enough (Answer) The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.37
  • 38. 38 Close Enough (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.38
  • 39. 39 Close Enough (Moral) ©2014 Cloudera, Inc. All rights reserved.39 • For scalability reasons, the values iterable is single-shot • Subsequent iterators iterate over an empty collection • Store values (not Writables!) in the first pass • Better yet, restructure the logic to avoid storing all values in memory
  • 40. 40 Overbyte public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.40
  • 41. 41 Overbyte The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.41
  • 42. 42 Overbyte public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.42 Hadoop Spark Hive Sqoop2
  • 43. 43 Overbyte The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.43
  • 44. 44 Overbyte (Answer) The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.44
  • 45. 45 Overbyte (Problem) public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.45
  • 46. 46 Overbyte (Moral) ©2014 Cloudera, Inc. All rights reserved.46 • Writables get reused in loops • In addition, Text.getBytes() reuses byte array allocated by previous calls • Net result is wrongness • Text.getLength() is the correct way to get the length of a Text.
  • 47. 47 What We Learned ©2014 Cloudera, Inc. All rights reserved.47 • Beware of reuse of Writables • Always use @Override so your compiler can help you • Don’t assume you know what a method does because of the name or parameters – read the docs! • Sometimes scalability is inconvenient
  • 48. 48 One Closing Note ©2014 Cloudera, Inc. All rights reserved.48 • Hadoop is still not easy • Being good takes effort and experience • Recognizing Hadoop talent can be hard • Cloudera’s is working to make Hadoop talent easier to recognize through certification http://cloudera.com/content/cloudera/en/training/cert ification.html
  • 49. 49 ©2014 Cloudera, Inc. All rights reserved. Aaron Myers & Daniel Templeton