SlideShare une entreprise Scribd logo
1  sur  31
Michael Häusler
ResearchGate
EVERYDAY FLINK
• Big data use cases at ResearchGate
• Frameworks
• Maintainability
• Performance
• Having fun
TOPICS
BIG DATA PROCESSING
AT RESEARCHGATE
Early 2011
First Hadoop Use Case
Author Analysis
Author Analysis
Author Analysis
Author Analysis
First version – processes enriching live database
Runtime – few weeks
Author Analysis
Second version – MapReduce on Hadoop
cluster
Runtime – few hours (incl. import and export)
Author Analysis
• Major effort – many things to consider
• Smarter algorithms – better clustering
• Efficiency – better runtime complexity
• Distributed computing
• Integration
Author Analysis
• One decision was easy due to limited choice:
Frameworks
• Mainly MapReduce
• Hadoop for special use cases
HIVE
Everyone knows SQL
Hadoop for everyone
Hive
• „Realtime“ ResearchGate is mostly NoSQL
• Hive brought back analytical SQL
Immense growth on the Hadoop cluster:
• Users
• Data
• Jobs
Thinking about Frameworks
MapReduce
• General purpose language
• Very low level
• Not for everyone
Hive
• Domain specific language
• Very high level
• Not for every use case
Having fun with Hadoop?
Frameworkitis is the disease that a framework
wants to do too much for you or it does it in a
way that you don't want but you can't change it.
Erich Gamma
Having fun with Hadoop?
Simple things should be simple,
complex things should be possible.
Alan Kay
Evaluating Frameworks
Mid 2013
How to Evaluate a Framework?
Obvious criteria
• Features
• Performance & Scalability
• Robustness & Stability
• Maturity & Community
Not so obvious
• Is it fun to solve simple, everyday problems?
Comparing today
Hive (0.14.0, Tez)
MapReduce (Hadoop 2.6, Yarn)
Flink (0.9.1, Yarn)
All inputs and outputs are Avro
Simple Use Case: Top 5 Coauthors
Simple Use Case: Top 5 Coauthors
publication = {
"publicationUid": 7,
"title": "Foo",
"authorships": [
{
"authorUid": 23,
"authorName": "Jane"
},
{
"authorUid": 25,
"authorName": "John"
}
]
}
authorAccountMapping = {
"authorUid": 23,
"accountId": 42
}
(7, 23)
(7, 25)
(7, "AC:42")
(7, "AU:25")
topCoauthorStats = {
"authorKey": "AC:42",
"topCoauthors": [
{
"coauthorKey": "AU:23",
"coauthorCount": 1
}
]
}
Hive
ADD JAR hdfs:///user/haeusler/hive-udfs-0.19.0.jar;
CREATE TEMPORARY FUNCTION TOP_K as 'net.researchgate.hive.udf.UDAFMaxBy';
DROP TABLE IF EXISTS haeusler.top_coauthors_2015_10_01_avro;
CREATE TABLE haeusler.top_coauthors_2015_10_01_avro STORED AS AVRO AS
SELECT
coauthors.authorKey1 AS authorKey,
TOP_K(
NAMED_STRUCT(
'coauthorKey', coauthors.authorKey2,
'coauthorCount', coauthors.count
),
coauthors.count,
5
) AS topCoAuthors
FROM
(
SELECT
publication_author_keys_1.authorKey AS authorKey1,
publication_author_keys_2.authorKey AS authorKey2,
COUNT(*) AS count
FROM
(
SELECT
pae.publicationUid,
COALESCE(
CONCAT('AC:', aam.accountId),
CONCAT('AU:', pae.authorUid)
) AS authorKey
FROM
(
SELECT
p.publicationUid,
pa.authorship.authorUid
FROM
platform_mongo_refind.publications_2015_10_01_avro p
LATERAL VIEW
EXPLODE(p.authorships) pa AS authorship
WHERE
pa.authorship.authorUid IS NOT NULL
) pae
LEFT OUTER JOIN
platform_mongo_refind.author_account_mappings_2015_10_01_avro aam
ON
pae.authorUid = aam.authorUid
) publication_author_keys_1
JOIN
(
SELECT
pae.publicationUid,
COALESCE(
CONCAT('account:', aam.accountId),
CONCAT('author:', pae.authorUid)
) AS authorKey
FROM
(
SELECT
p.publicationUid,
pa.authorship.authorUid
FROM
platform_mongo_refind.publications_2015_10_01_avro p
LATERAL VIEW
EXPLODE(p.authorships) pa AS authorship
WHERE
pa.authorship.authorUid IS NOT NULL
) pae
LEFT OUTER JOIN
platform_mongo_refind.author_account_mappings_2015_10_01_avro aam
ON
pae.authorUid = aam.authorUid
) publication_author_keys_2
ON
publication_author_keys_1.publicationUid = publication_author_keys_2.publicationUid
WHERE
publication_author_keys_1.authorKey <> publication_author_keys_2.authorKey
GROUP BY
publication_author_keys_1.authorKey,
publication_author_keys_2.authorKey
) coauthors
GROUP BY
coauthors.authorKey1;
SELECT
pae.publicationUid,
COALESCE(
CONCAT('AC:', aam.accountId),
CONCAT('AU:', pae.authorUid)
) AS authorKey
FROM
(
SELECT
p.publicationUid,
pa.authorship.authorUid
FROM
publications_2015_10_01_avro p
LATERAL VIEW
EXPLODE(p.authorships) pa AS authorship
WHERE
pa.authorship.authorUid IS NOT NULL
) pae
LEFT OUTER JOIN
author_account_mappings_2015_10_01_avro aam
ON
pae.authorUid = aam.authorUid
CREATE TEMPORARY FUNCTION TOP_K as
'net.researchgate.hive.udf.UDAFMaxBy';
SELECT
coauthors.authorKey1 AS authorKey,
TOP_K(
NAMED_STRUCT(
'coauthorKey', coauthors.authorKey2,
'coauthorCount', coauthors.count
),
coauthors.count,
5
) AS topCoAuthors
Hive
package net.researchgate.authorstats.hive;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableConstantIntObjectInspector;
import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.io.IntWritable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
/**
* Returns top n values sorted by keys.
* <p/>
* The output is an array of values
*/
@Description(name = "top_k",
value = "_FUNC_(value, key, n) - Returns the top n values with the maximum keys",
extended = "Example:n"
+ "> SELECT top_k(value, key, 3) FROM src;n"
+ "[3, 2, 1]n"
+ "The return value is an array of values which correspond to the maximum keys"
)
public class UDAFTopK extends AbstractGenericUDAFResolver {
// class static variables
static final Log LOG = LogFactory.getLog(UDAFTopK.class.getName());
private static void ensurePrimitive(int paramIndex, TypeInfo parameter) throws UDFArgumentTypeException {
if (parameter.getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(paramIndex, "Only primitive type arguments are accepted but "
+ parameter.getTypeName() + " was passed as parameter " + Integer.toString(paramIndex + 1) + ".");
}
}
private static void ensureInt(int paramIndex, TypeInfo parameter) throws UDFArgumentTypeException {
ensurePrimitive(paramIndex, parameter);
PrimitiveTypeInfo pti = (PrimitiveTypeInfo) parameter;
switch (pti.getPrimitiveCategory()) {
case INT:
return;
default:
throw new IllegalStateException("Unhandled primitive");
}
}
private static void ensureNumberOfArguments(int n, TypeInfo[] parameters) throws SemanticException {
if (parameters.length != n) {
throw new UDFArgumentTypeException(parameters.length - 1, "Please specify exactly " + Integer.toString(n) + " arguments.");
}
}
@Override
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException {
ensureNumberOfArguments(3, parameters);
//argument 0 can be any
ensurePrimitive(1, parameters[1]);
ensureInt(2, parameters[2]);
return new TopKUDAFEvaluator();
}
public static class TopKUDAFEvaluator extends GenericUDAFEvaluator {
static final Log LOG = LogFactory.getLog(TopKUDAFEvaluator.class.getName());
public static class IntermObjectInspector {
public StandardStructObjectInspector topSoi;
public PrimitiveObjectInspector noi;
public StandardListObjectInspector loi;
public StandardStructObjectInspector soi;
public ObjectInspector oiValue;
public PrimitiveObjectInspector oiKey;
public IntermObjectInspector(StandardStructObjectInspector topSoi) throws HiveException {
this.topSoi = topSoi;
this.noi = (PrimitiveObjectInspector) topSoi.getStructFieldRef("n").getFieldObjectInspector();
this.loi = (StandardListObjectInspector) topSoi.getStructFieldRef("data").getFieldObjectInspector();
soi = (StandardStructObjectInspector) loi.getListElementObjectInspector();
oiValue = soi.getStructFieldRef("value").getFieldObjectInspector();
oiKey = (PrimitiveObjectInspector) soi.getStructFieldRef("key").getFieldObjectInspector();
}
}
private transient ObjectInspector oiValue;
private transient PrimitiveObjectInspector oiKey;
private transient IntermObjectInspector ioi;
private transient int topN;
/**
* PARTIAL1: from original data to partial aggregation data:
* iterate() and terminatePartial() will be called.
* <p/>
* <p/>
* PARTIAL2: from partial aggregation data to partial aggregation data:
* merge() and terminatePartial() will be called.
* <p/>
* FINAL: from partial aggregation to full aggregation:
* merge() and terminate() will be called.
* <p/>
* <p/>
* COMPLETE: from original data directly to full aggregation:
* iterate() and terminate() will be called.
*/
private static StandardStructObjectInspector getTerminatePartialOutputType(ObjectInspector oiValueMaybeLazy, PrimitiveObjectInspector oiKeyMaybeLazy) throws HiveException {
StandardListObjectInspector loi = ObjectInspectorFactory.getStandardListObjectInspector(getTerminatePartialOutputElementType(oiValueMaybeLazy, oiKeyMaybeLazy));
PrimitiveObjectInspector oiN = PrimitiveObjectInspectorFactory.writableIntObjectInspector;
ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>();
foi.add(oiN);
foi.add(loi);
ArrayList<String> fnames = new ArrayList<String>();
fnames.add("n");
fnames.add("data");
return ObjectInspectorFactory.getStandardStructObjectInspector(fnames, foi);
}
private static StandardStructObjectInspector getTerminatePartialOutputElementType(ObjectInspector oiValueMaybeLazy, PrimitiveObjectInspector oiKeyMaybeLazy) throws HiveException {
ObjectInspector oiValue = TypeUtils.makeStrict(oiValueMaybeLazy);
PrimitiveObjectInspector oiKey = TypeUtils.primitiveMakeStrict(oiKeyMaybeLazy);
ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>();
foi.add(oiValue);
foi.add(oiKey);
ArrayList<String> fnames = new ArrayList<String>();
fnames.add("value");
fnames.add("key");
return ObjectInspectorFactory.getStandardStructObjectInspector(fnames, foi);
}
private static StandardListObjectInspector getCompleteOutputType(IntermObjectInspector ioi) {
return ObjectInspectorFactory.getStandardListObjectInspector(ioi.oiValue);
}
private static int getTopNValue(PrimitiveObjectInspector parameter) throws HiveException {
if (parameter instanceof WritableConstantIntObjectInspector) {
WritableConstantIntObjectInspector nvOI = (WritableConstantIntObjectInspector) parameter;
int numTop = nvOI.getWritableConstantValue().get();
return numTop;
} else {
throw new HiveException("The third parameter: number of max values returned must be a constant int but the parameter was of type " + parameter.getClass().getName());
}
}
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {
super.init(m, parameters);
if (m == Mode.PARTIAL1) {
//for iterate
assert (parameters.length == 3);
oiValue = parameters[0];
oiKey = (PrimitiveObjectInspector) parameters[1];
topN = getTopNValue((PrimitiveObjectInspector) parameters[2]);
//create type R = list(struct(keyType,valueType))
ioi = new IntermObjectInspector(getTerminatePartialOutputType(oiValue, oiKey));
//for terminate partial
return ioi.topSoi;//call this type R
} else if (m == Mode.PARTIAL2) {
ioi = new IntermObjectInspector((StandardStructObjectInspector) parameters[0]); //type R (see above)
//for merge and terminate partial
return ioi.topSoi;//type R
} else if (m == Mode.COMPLETE) {
assert (parameters.length == 3);
//for iterate
oiValue = parameters[0];
oiKey = (PrimitiveObjectInspector) parameters[1];
topN = getTopNValue((PrimitiveObjectInspector) parameters[2]);
ioi = new IntermObjectInspector(getTerminatePartialOutputType(oiValue, oiKey));//type R (see above)
//for terminate
return getCompleteOutputType(ioi);
} else if (m == Mode.FINAL) {
//for merge
ioi = new IntermObjectInspector((StandardStructObjectInspector) parameters[0]); //type R (see above)
//for terminate
//type O = list(valueType)
return getCompleteOutputType(ioi);
}
throw new IllegalStateException("Unknown mode");
}
@Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
StdAgg stdAgg = (StdAgg) agg;
return stdAgg.serialize(ioi);
}
@Override
public Object terminate(AggregationBuffer agg) throws HiveException {
StdAgg stdAgg = (StdAgg) agg;
if (stdAgg == null) {
return null;
}
return stdAgg.terminate(ioi.oiKey);
}
@Override
public void merge(AggregationBuffer agg, Object partial) throws HiveException {
if (partial == null) {
return;
}
StdAgg stdAgg = (StdAgg) agg;
stdAgg.merge(ioi, partial);
}
@Override
public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException {
assert (parameters.length == 3);
if (parameters[0] == null || parameters[1] == null || parameters[2] == null) {
return;
}
StdAgg stdAgg = (StdAgg) agg;
stdAgg.setTopN(topN);
stdAgg.add(parameters, oiValue, oiKey, ioi);
}
// Aggregation buffer definition and manipulation methods
@AggregationType(estimable = false)
static class StdAgg extends AbstractAggregationBuffer {
public static class KeyValuePair {
public Object key;
public Object value;
public KeyValuePair(Object key, Object value) {
this.key = key;
this.value = value;
}
}
public static class KeyValueComparator implements Comparator<KeyValuePair> {
public PrimitiveObjectInspector getKeyObjectInspector() {
return keyObjectInspector;
}
public void setKeyObjectInspector(PrimitiveObjectInspector keyObjectInspector) {
this.keyObjectInspector = keyObjectInspector;
}
PrimitiveObjectInspector keyObjectInspector;
@Override
public int compare(KeyValuePair o1, KeyValuePair o2) {
if (keyObjectInspector == null) {
throw new IllegalStateException("Key object inspector has to be initialized.");
}
//the heap will store the min element on top
return ObjectInspectorUtils.compare(o1.key, keyObjectInspector, o2.key, keyObjectInspector);
}
}
public PriorityQueue<KeyValuePair> queue;
int topN;
public void setTopN(int topN) {
this.topN = topN;
}
public int getTopN() {
return topN;
}
public void reset() {
queue = new PriorityQueue<KeyValuePair>(10, new KeyValueComparator());
}
public void add(Object[] parameters, ObjectInspector oiValue, PrimitiveObjectInspector oiKey, IntermObjectInspector ioi) {
assert (parameters.length == 3);
Object paramValue = parameters[0];
Object paramKey = parameters[1];
if (paramValue == null || paramKey == null) {
return;
}
Object stdValue = ObjectInspectorUtils.copyToStandardObject(paramValue, oiValue, ObjectInspectorUtils.ObjectInspectorCopyOption.WRITABLE);
Object stdKey = ObjectInspectorUtils.copyToStandardObject(paramKey, oiKey, ObjectInspectorUtils.ObjectInspectorCopyOption.WRITABLE);
addToQueue(stdKey, stdValue, ioi.oiKey);
}
public void addToQueue(Object key, Object value, PrimitiveObjectInspector oiKey) {
final PrimitiveObjectInspector keyObjectInspector = oiKey;
KeyValueComparator comparator = ((KeyValueComparator) queue.comparator());
comparator.setKeyObjectInspector(keyObjectInspector);
queue.add(new KeyValuePair(key, value));
if (queue.size() > topN) {
queue.remove();
}
comparator.setKeyObjectInspector(null);
}
private KeyValuePair[] copyQueueToArray() {
int n = queue.size();
KeyValuePair[] buffer = new KeyValuePair[n];
int i = 0;
for (KeyValuePair pair : queue) {
buffer[i] = pair;
i++;
}
return buffer;
}
public List<Object> terminate(final PrimitiveObjectInspector keyObjectInspector) {
KeyValuePair[] buffer = copyQueueToArray();
Arrays.sort(buffer, new Comparator<KeyValuePair>() {
public int compare(KeyValuePair o1, KeyValuePair o2) {
return ObjectInspectorUtils.compare(o2.key, keyObjectInspector, o1.key, keyObjectInspector);
}
});
//copy the values to ArrayList
ArrayList<Object> result = new ArrayList<Object>();
for (int j = 0; j < buffer.length; j++) {
result.add(buffer[j].value);
}
return result;
}
public Object serialize(IntermObjectInspector ioi) {
StandardStructObjectInspector topLevelSoi = ioi.topSoi;
Object topLevelObj = topLevelSoi.create();
StandardListObjectInspector loi = ioi.loi;
StandardStructObjectInspector soi = ioi.soi;
int n = queue.size();
Object loiObj = loi.create(n);
int i = 0;
for (KeyValuePair pair : queue) {
Object soiObj = soi.create();
soi.setStructFieldData(soiObj, soi.getStructFieldRef("value"), pair.value);
soi.setStructFieldData(soiObj, soi.getStructFieldRef("key"), pair.key);
loi.set(loiObj, i, soiObj);
i += 1;
}
topLevelSoi.setStructFieldData(topLevelObj, topLevelSoi.getStructFieldRef("n"), new IntWritable(topN));
topLevelSoi.setStructFieldData(topLevelObj, topLevelSoi.getStructFieldRef("data"), loiObj);
return topLevelObj;
}
public void merge(IntermObjectInspector ioi, Object partial) {
List<Object> nestedValues = ioi.topSoi.getStructFieldsDataAsList(partial);
topN = (Integer) (ioi.noi.getPrimitiveJavaObject(nestedValues.get(0)));
StandardListObjectInspector loi = ioi.loi;
StandardStructObjectInspector soi = ioi.soi;
PrimitiveObjectInspector oiKey = ioi.oiKey;
Object data = nestedValues.get(1);
int n = loi.getListLength(data);
int i = 0;
while (i < n) {
Object sValue = loi.getListElement(data, i);
List<Object> innerValues = soi.getStructFieldsDataAsList(sValue);
Object primValue = innerValues.get(0);
Object primKey = innerValues.get(1);
addToQueue(primKey, primValue, oiKey);
i += 1;
}
}
}
;
@Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
StdAgg result = new StdAgg();
reset(result);
return result;
}
@Override
public void reset(AggregationBuffer agg) throws HiveException {
StdAgg stdAgg = (StdAgg) agg;
stdAgg.reset();
}
}
}
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {
super.init(m, parameters);
if (m == Mode.PARTIAL1) {
//for iterate
assert (parameters.length == 3);
oiValue = parameters[0];
oiKey = (PrimitiveObjectInspector) parameters[1];
topN = getTopNValue((PrimitiveObjectInspector) parameters[2]);
//create type R = list(struct(keyType,valueType))
ioi = new IntermObjectInspector(getTerminatePartialOutputType(oiValue, oiKey));
//for terminate partial
return ioi.topSoi;//call this type R
} else if (m == Mode.PARTIAL2) {
ioi = new IntermObjectInspector((StandardStructObjectInspector) parameters[0]);
//type R (see above)
//for merge and terminate partial
return ioi.topSoi;//type R
} else if (m == Mode.COMPLETE) {
assert (parameters.length == 3);
//for iterate
oiValue = parameters[0];
oiKey = (PrimitiveObjectInspector) parameters[1];
topN = getTopNValue((PrimitiveObjectInspector) parameters[2]);
ioi = new IntermObjectInspector(getTerminatePartialOutputType(oiValue,
oiKey));//type R (see above)
//for terminate
return getCompleteOutputType(ioi);
} else if (m == Mode.FINAL) {
//for merge
ioi = new IntermObjectInspector((StandardStructObjectInspector) parameters[0]);
//type R (see above)
//for terminate
//type O = list(valueType)
return getCompleteOutputType(ioi);
}
throw new IllegalStateException("Unknown mode");
}
Hive
• Join and group by are easy
• Common subexpressions are
not optimized
• Dealing with denormalized data
can be tricky
• UDFs are implemented low level
and need to be deployed
• UDAFs (aggregation functions)
require expert knowledge
• Dealing with generic UDFs is no fun
MapReduce
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return BinaryData.compare(b1, s1, l1, b2, s2, l2, pair);
}
public int compare(AvroKey<Pair<Long,Long>> x, AvroKey<Pair<Long,Long>> y) {
return ReflectData.get().compare(x.datum(), y.datum(), pair);
}
MapReduce
• Pure map and reduce is very restrictive
• Map side joins require knowledge of
distributed cache
• Bot map and reduce side joins require
assumptions about data sizes
• Constant type juggling
• Hard to glue together
• Hard to test
• Implementing secondary sorting in an
AvroMapper is no fun
Flink
public static void buildJob(ExecutionEnvironment env,
DataSet<Publication> publications,
DataSet<AuthorAccountMapping> authorAccountMappings,
OutputFormat<TopCoauthorStats> topCoauthorStats) {
publications.flatMap(new FlatMapFunction<Publication, Tuple2<Long, Long>>() {
@Override
public void flatMap(Publication publication, Collector<Tuple2<Long, Long>> publicationAuthors) throws Exception {
if (publication.getAuthorships() == null) {
return;
}
for (Authorship authorship : publication.getAuthorships()) {
if (authorship.getAuthorUid() == null) {
continue;
}
publicationAuthors.collect(new Tuple2<>(publication.getPublicationUid(), authorship.getAuthorUid()));
}
}
}).coGroup(authorAccountMappings).where(1)
// ...
Flink for Simple Use Cases
• Fluent API
• Rich set of transformations
• Support for Tuples and POJOs
• With some discipline separation of business
logic possible
• Java 7 API still requires some boilerplate
• No elasticity yet
• Fastest and most fun to implement
Performance Comparison
0:00:00
0:07:12
0:14:24
0:21:36
0:28:48
0:36:00
50 100
Execution Time
Hive (Tez) MapReduce Flink
Performance and fun – every day
QUESTIONS

Contenu connexe

Tendances

Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
Dan Lynn
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
Holden Karau
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
P. Taylor Goetz
 

Tendances (20)

Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Predictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with Strymon
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Processing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) PregelProcessing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) Pregel
 
Stream Processing Frameworks
Stream Processing FrameworksStream Processing Frameworks
Stream Processing Frameworks
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
 Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at... Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Storm
StormStorm
Storm
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul Dix
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 

En vedette

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 

En vedette (20)

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 

Similaire à Michael Häusler – Everyday flink

CouchDB on Android
CouchDB on AndroidCouchDB on Android
CouchDB on Android
Sven Haiges
 
Refactoring In Tdd The Missing Part
Refactoring In Tdd The Missing PartRefactoring In Tdd The Missing Part
Refactoring In Tdd The Missing Part
Gabriele Lana
 
Developing Useful APIs
Developing Useful APIsDeveloping Useful APIs
Developing Useful APIs
Dmitry Buzdin
 
Open XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand DechouxOpen XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand Dechoux
Publicis Sapient Engineering
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 

Similaire à Michael Häusler – Everyday flink (20)

CouchDB on Android
CouchDB on AndroidCouchDB on Android
CouchDB on Android
 
Refactoring In Tdd The Missing Part
Refactoring In Tdd The Missing PartRefactoring In Tdd The Missing Part
Refactoring In Tdd The Missing Part
 
Apache Cassandra and Go
Apache Cassandra and GoApache Cassandra and Go
Apache Cassandra and Go
 
Developing Useful APIs
Developing Useful APIsDeveloping Useful APIs
Developing Useful APIs
 
Cassandra rapid prototyping with achilles
Cassandra rapid prototyping with achillesCassandra rapid prototyping with achilles
Cassandra rapid prototyping with achilles
 
Paris js extensions
Paris js extensionsParis js extensions
Paris js extensions
 
Android dev toolbox
Android dev toolboxAndroid dev toolbox
Android dev toolbox
 
CouchDB-Lucene
CouchDB-LuceneCouchDB-Lucene
CouchDB-Lucene
 
TechDays 2016 - Developing websites using asp.net core mvc6 and entity framew...
TechDays 2016 - Developing websites using asp.net core mvc6 and entity framew...TechDays 2016 - Developing websites using asp.net core mvc6 and entity framew...
TechDays 2016 - Developing websites using asp.net core mvc6 and entity framew...
 
Getting the most out of Java [Nordic Coding-2010]
Getting the most out of Java [Nordic Coding-2010]Getting the most out of Java [Nordic Coding-2010]
Getting the most out of Java [Nordic Coding-2010]
 
C# 6 and 7 and Futures 20180607
C# 6 and 7 and Futures 20180607C# 6 and 7 and Futures 20180607
C# 6 and 7 and Futures 20180607
 
Open XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand DechouxOpen XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand Dechoux
 
Sencha Roadshow 2017: Modernizing the Ext JS Class System and Tooling
Sencha Roadshow 2017: Modernizing the Ext JS Class System and ToolingSencha Roadshow 2017: Modernizing the Ext JS Class System and Tooling
Sencha Roadshow 2017: Modernizing the Ext JS Class System and Tooling
 
jQuery introduction
jQuery introductionjQuery introduction
jQuery introduction
 
外部環境への依存をテストする
外部環境への依存をテストする外部環境への依存をテストする
外部環境への依存をテストする
 
Anti patterns
Anti patternsAnti patterns
Anti patterns
 
Annotation processing in android
Annotation processing in androidAnnotation processing in android
Annotation processing in android
 
Akka with Scala
Akka with ScalaAkka with Scala
Akka with Scala
 
Clean coding-practices
Clean coding-practicesClean coding-practices
Clean coding-practices
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 

Plus de Flink Forward

Plus de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Michael Häusler – Everyday flink

  • 2. • Big data use cases at ResearchGate • Frameworks • Maintainability • Performance • Having fun TOPICS
  • 3.
  • 4.
  • 5.
  • 6. BIG DATA PROCESSING AT RESEARCHGATE Early 2011 First Hadoop Use Case Author Analysis
  • 9. Author Analysis First version – processes enriching live database Runtime – few weeks
  • 10. Author Analysis Second version – MapReduce on Hadoop cluster Runtime – few hours (incl. import and export)
  • 11. Author Analysis • Major effort – many things to consider • Smarter algorithms – better clustering • Efficiency – better runtime complexity • Distributed computing • Integration
  • 12. Author Analysis • One decision was easy due to limited choice: Frameworks • Mainly MapReduce • Hadoop for special use cases
  • 14. Hive • „Realtime“ ResearchGate is mostly NoSQL • Hive brought back analytical SQL Immense growth on the Hadoop cluster: • Users • Data • Jobs
  • 15. Thinking about Frameworks MapReduce • General purpose language • Very low level • Not for everyone Hive • Domain specific language • Very high level • Not for every use case
  • 16. Having fun with Hadoop? Frameworkitis is the disease that a framework wants to do too much for you or it does it in a way that you don't want but you can't change it. Erich Gamma
  • 17. Having fun with Hadoop? Simple things should be simple, complex things should be possible. Alan Kay
  • 19. How to Evaluate a Framework? Obvious criteria • Features • Performance & Scalability • Robustness & Stability • Maturity & Community Not so obvious • Is it fun to solve simple, everyday problems?
  • 20. Comparing today Hive (0.14.0, Tez) MapReduce (Hadoop 2.6, Yarn) Flink (0.9.1, Yarn) All inputs and outputs are Avro
  • 21. Simple Use Case: Top 5 Coauthors
  • 22. Simple Use Case: Top 5 Coauthors publication = { "publicationUid": 7, "title": "Foo", "authorships": [ { "authorUid": 23, "authorName": "Jane" }, { "authorUid": 25, "authorName": "John" } ] } authorAccountMapping = { "authorUid": 23, "accountId": 42 } (7, 23) (7, 25) (7, "AC:42") (7, "AU:25") topCoauthorStats = { "authorKey": "AC:42", "topCoauthors": [ { "coauthorKey": "AU:23", "coauthorCount": 1 } ] }
  • 23. Hive ADD JAR hdfs:///user/haeusler/hive-udfs-0.19.0.jar; CREATE TEMPORARY FUNCTION TOP_K as 'net.researchgate.hive.udf.UDAFMaxBy'; DROP TABLE IF EXISTS haeusler.top_coauthors_2015_10_01_avro; CREATE TABLE haeusler.top_coauthors_2015_10_01_avro STORED AS AVRO AS SELECT coauthors.authorKey1 AS authorKey, TOP_K( NAMED_STRUCT( 'coauthorKey', coauthors.authorKey2, 'coauthorCount', coauthors.count ), coauthors.count, 5 ) AS topCoAuthors FROM ( SELECT publication_author_keys_1.authorKey AS authorKey1, publication_author_keys_2.authorKey AS authorKey2, COUNT(*) AS count FROM ( SELECT pae.publicationUid, COALESCE( CONCAT('AC:', aam.accountId), CONCAT('AU:', pae.authorUid) ) AS authorKey FROM ( SELECT p.publicationUid, pa.authorship.authorUid FROM platform_mongo_refind.publications_2015_10_01_avro p LATERAL VIEW EXPLODE(p.authorships) pa AS authorship WHERE pa.authorship.authorUid IS NOT NULL ) pae LEFT OUTER JOIN platform_mongo_refind.author_account_mappings_2015_10_01_avro aam ON pae.authorUid = aam.authorUid ) publication_author_keys_1 JOIN ( SELECT pae.publicationUid, COALESCE( CONCAT('account:', aam.accountId), CONCAT('author:', pae.authorUid) ) AS authorKey FROM ( SELECT p.publicationUid, pa.authorship.authorUid FROM platform_mongo_refind.publications_2015_10_01_avro p LATERAL VIEW EXPLODE(p.authorships) pa AS authorship WHERE pa.authorship.authorUid IS NOT NULL ) pae LEFT OUTER JOIN platform_mongo_refind.author_account_mappings_2015_10_01_avro aam ON pae.authorUid = aam.authorUid ) publication_author_keys_2 ON publication_author_keys_1.publicationUid = publication_author_keys_2.publicationUid WHERE publication_author_keys_1.authorKey <> publication_author_keys_2.authorKey GROUP BY publication_author_keys_1.authorKey, publication_author_keys_2.authorKey ) coauthors GROUP BY coauthors.authorKey1; SELECT pae.publicationUid, COALESCE( CONCAT('AC:', aam.accountId), CONCAT('AU:', pae.authorUid) ) AS authorKey FROM ( SELECT p.publicationUid, pa.authorship.authorUid FROM publications_2015_10_01_avro p LATERAL VIEW EXPLODE(p.authorships) pa AS authorship WHERE pa.authorship.authorUid IS NOT NULL ) pae LEFT OUTER JOIN author_account_mappings_2015_10_01_avro aam ON pae.authorUid = aam.authorUid CREATE TEMPORARY FUNCTION TOP_K as 'net.researchgate.hive.udf.UDAFMaxBy'; SELECT coauthors.authorKey1 AS authorKey, TOP_K( NAMED_STRUCT( 'coauthorKey', coauthors.authorKey2, 'coauthorCount', coauthors.count ), coauthors.count, 5 ) AS topCoAuthors
  • 24. Hive package net.researchgate.authorstats.hive; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.parse.SemanticException; import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver; import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils; import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableConstantIntObjectInspector; import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo; import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo; import org.apache.hadoop.io.IntWritable; import java.util.ArrayList; import java.util.Arrays; import java.util.Comparator; import java.util.List; import java.util.PriorityQueue; /** * Returns top n values sorted by keys. * <p/> * The output is an array of values */ @Description(name = "top_k", value = "_FUNC_(value, key, n) - Returns the top n values with the maximum keys", extended = "Example:n" + "> SELECT top_k(value, key, 3) FROM src;n" + "[3, 2, 1]n" + "The return value is an array of values which correspond to the maximum keys" ) public class UDAFTopK extends AbstractGenericUDAFResolver { // class static variables static final Log LOG = LogFactory.getLog(UDAFTopK.class.getName()); private static void ensurePrimitive(int paramIndex, TypeInfo parameter) throws UDFArgumentTypeException { if (parameter.getCategory() != ObjectInspector.Category.PRIMITIVE) { throw new UDFArgumentTypeException(paramIndex, "Only primitive type arguments are accepted but " + parameter.getTypeName() + " was passed as parameter " + Integer.toString(paramIndex + 1) + "."); } } private static void ensureInt(int paramIndex, TypeInfo parameter) throws UDFArgumentTypeException { ensurePrimitive(paramIndex, parameter); PrimitiveTypeInfo pti = (PrimitiveTypeInfo) parameter; switch (pti.getPrimitiveCategory()) { case INT: return; default: throw new IllegalStateException("Unhandled primitive"); } } private static void ensureNumberOfArguments(int n, TypeInfo[] parameters) throws SemanticException { if (parameters.length != n) { throw new UDFArgumentTypeException(parameters.length - 1, "Please specify exactly " + Integer.toString(n) + " arguments."); } } @Override public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException { ensureNumberOfArguments(3, parameters); //argument 0 can be any ensurePrimitive(1, parameters[1]); ensureInt(2, parameters[2]); return new TopKUDAFEvaluator(); } public static class TopKUDAFEvaluator extends GenericUDAFEvaluator { static final Log LOG = LogFactory.getLog(TopKUDAFEvaluator.class.getName()); public static class IntermObjectInspector { public StandardStructObjectInspector topSoi; public PrimitiveObjectInspector noi; public StandardListObjectInspector loi; public StandardStructObjectInspector soi; public ObjectInspector oiValue; public PrimitiveObjectInspector oiKey; public IntermObjectInspector(StandardStructObjectInspector topSoi) throws HiveException { this.topSoi = topSoi; this.noi = (PrimitiveObjectInspector) topSoi.getStructFieldRef("n").getFieldObjectInspector(); this.loi = (StandardListObjectInspector) topSoi.getStructFieldRef("data").getFieldObjectInspector(); soi = (StandardStructObjectInspector) loi.getListElementObjectInspector(); oiValue = soi.getStructFieldRef("value").getFieldObjectInspector(); oiKey = (PrimitiveObjectInspector) soi.getStructFieldRef("key").getFieldObjectInspector(); } } private transient ObjectInspector oiValue; private transient PrimitiveObjectInspector oiKey; private transient IntermObjectInspector ioi; private transient int topN; /** * PARTIAL1: from original data to partial aggregation data: * iterate() and terminatePartial() will be called. * <p/> * <p/> * PARTIAL2: from partial aggregation data to partial aggregation data: * merge() and terminatePartial() will be called. * <p/> * FINAL: from partial aggregation to full aggregation: * merge() and terminate() will be called. * <p/> * <p/> * COMPLETE: from original data directly to full aggregation: * iterate() and terminate() will be called. */ private static StandardStructObjectInspector getTerminatePartialOutputType(ObjectInspector oiValueMaybeLazy, PrimitiveObjectInspector oiKeyMaybeLazy) throws HiveException { StandardListObjectInspector loi = ObjectInspectorFactory.getStandardListObjectInspector(getTerminatePartialOutputElementType(oiValueMaybeLazy, oiKeyMaybeLazy)); PrimitiveObjectInspector oiN = PrimitiveObjectInspectorFactory.writableIntObjectInspector; ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>(); foi.add(oiN); foi.add(loi); ArrayList<String> fnames = new ArrayList<String>(); fnames.add("n"); fnames.add("data"); return ObjectInspectorFactory.getStandardStructObjectInspector(fnames, foi); } private static StandardStructObjectInspector getTerminatePartialOutputElementType(ObjectInspector oiValueMaybeLazy, PrimitiveObjectInspector oiKeyMaybeLazy) throws HiveException { ObjectInspector oiValue = TypeUtils.makeStrict(oiValueMaybeLazy); PrimitiveObjectInspector oiKey = TypeUtils.primitiveMakeStrict(oiKeyMaybeLazy); ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>(); foi.add(oiValue); foi.add(oiKey); ArrayList<String> fnames = new ArrayList<String>(); fnames.add("value"); fnames.add("key"); return ObjectInspectorFactory.getStandardStructObjectInspector(fnames, foi); } private static StandardListObjectInspector getCompleteOutputType(IntermObjectInspector ioi) { return ObjectInspectorFactory.getStandardListObjectInspector(ioi.oiValue); } private static int getTopNValue(PrimitiveObjectInspector parameter) throws HiveException { if (parameter instanceof WritableConstantIntObjectInspector) { WritableConstantIntObjectInspector nvOI = (WritableConstantIntObjectInspector) parameter; int numTop = nvOI.getWritableConstantValue().get(); return numTop; } else { throw new HiveException("The third parameter: number of max values returned must be a constant int but the parameter was of type " + parameter.getClass().getName()); } } @Override public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException { super.init(m, parameters); if (m == Mode.PARTIAL1) { //for iterate assert (parameters.length == 3); oiValue = parameters[0]; oiKey = (PrimitiveObjectInspector) parameters[1]; topN = getTopNValue((PrimitiveObjectInspector) parameters[2]); //create type R = list(struct(keyType,valueType)) ioi = new IntermObjectInspector(getTerminatePartialOutputType(oiValue, oiKey)); //for terminate partial return ioi.topSoi;//call this type R } else if (m == Mode.PARTIAL2) { ioi = new IntermObjectInspector((StandardStructObjectInspector) parameters[0]); //type R (see above) //for merge and terminate partial return ioi.topSoi;//type R } else if (m == Mode.COMPLETE) { assert (parameters.length == 3); //for iterate oiValue = parameters[0]; oiKey = (PrimitiveObjectInspector) parameters[1]; topN = getTopNValue((PrimitiveObjectInspector) parameters[2]); ioi = new IntermObjectInspector(getTerminatePartialOutputType(oiValue, oiKey));//type R (see above) //for terminate return getCompleteOutputType(ioi); } else if (m == Mode.FINAL) { //for merge ioi = new IntermObjectInspector((StandardStructObjectInspector) parameters[0]); //type R (see above) //for terminate //type O = list(valueType) return getCompleteOutputType(ioi); } throw new IllegalStateException("Unknown mode"); } @Override public Object terminatePartial(AggregationBuffer agg) throws HiveException { StdAgg stdAgg = (StdAgg) agg; return stdAgg.serialize(ioi); } @Override public Object terminate(AggregationBuffer agg) throws HiveException { StdAgg stdAgg = (StdAgg) agg; if (stdAgg == null) { return null; } return stdAgg.terminate(ioi.oiKey); } @Override public void merge(AggregationBuffer agg, Object partial) throws HiveException { if (partial == null) { return; } StdAgg stdAgg = (StdAgg) agg; stdAgg.merge(ioi, partial); } @Override public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException { assert (parameters.length == 3); if (parameters[0] == null || parameters[1] == null || parameters[2] == null) { return; } StdAgg stdAgg = (StdAgg) agg; stdAgg.setTopN(topN); stdAgg.add(parameters, oiValue, oiKey, ioi); } // Aggregation buffer definition and manipulation methods @AggregationType(estimable = false) static class StdAgg extends AbstractAggregationBuffer { public static class KeyValuePair { public Object key; public Object value; public KeyValuePair(Object key, Object value) { this.key = key; this.value = value; } } public static class KeyValueComparator implements Comparator<KeyValuePair> { public PrimitiveObjectInspector getKeyObjectInspector() { return keyObjectInspector; } public void setKeyObjectInspector(PrimitiveObjectInspector keyObjectInspector) { this.keyObjectInspector = keyObjectInspector; } PrimitiveObjectInspector keyObjectInspector; @Override public int compare(KeyValuePair o1, KeyValuePair o2) { if (keyObjectInspector == null) { throw new IllegalStateException("Key object inspector has to be initialized."); } //the heap will store the min element on top return ObjectInspectorUtils.compare(o1.key, keyObjectInspector, o2.key, keyObjectInspector); } } public PriorityQueue<KeyValuePair> queue; int topN; public void setTopN(int topN) { this.topN = topN; } public int getTopN() { return topN; } public void reset() { queue = new PriorityQueue<KeyValuePair>(10, new KeyValueComparator()); } public void add(Object[] parameters, ObjectInspector oiValue, PrimitiveObjectInspector oiKey, IntermObjectInspector ioi) { assert (parameters.length == 3); Object paramValue = parameters[0]; Object paramKey = parameters[1]; if (paramValue == null || paramKey == null) { return; } Object stdValue = ObjectInspectorUtils.copyToStandardObject(paramValue, oiValue, ObjectInspectorUtils.ObjectInspectorCopyOption.WRITABLE); Object stdKey = ObjectInspectorUtils.copyToStandardObject(paramKey, oiKey, ObjectInspectorUtils.ObjectInspectorCopyOption.WRITABLE); addToQueue(stdKey, stdValue, ioi.oiKey); } public void addToQueue(Object key, Object value, PrimitiveObjectInspector oiKey) { final PrimitiveObjectInspector keyObjectInspector = oiKey; KeyValueComparator comparator = ((KeyValueComparator) queue.comparator()); comparator.setKeyObjectInspector(keyObjectInspector); queue.add(new KeyValuePair(key, value)); if (queue.size() > topN) { queue.remove(); } comparator.setKeyObjectInspector(null); } private KeyValuePair[] copyQueueToArray() { int n = queue.size(); KeyValuePair[] buffer = new KeyValuePair[n]; int i = 0; for (KeyValuePair pair : queue) { buffer[i] = pair; i++; } return buffer; } public List<Object> terminate(final PrimitiveObjectInspector keyObjectInspector) { KeyValuePair[] buffer = copyQueueToArray(); Arrays.sort(buffer, new Comparator<KeyValuePair>() { public int compare(KeyValuePair o1, KeyValuePair o2) { return ObjectInspectorUtils.compare(o2.key, keyObjectInspector, o1.key, keyObjectInspector); } }); //copy the values to ArrayList ArrayList<Object> result = new ArrayList<Object>(); for (int j = 0; j < buffer.length; j++) { result.add(buffer[j].value); } return result; } public Object serialize(IntermObjectInspector ioi) { StandardStructObjectInspector topLevelSoi = ioi.topSoi; Object topLevelObj = topLevelSoi.create(); StandardListObjectInspector loi = ioi.loi; StandardStructObjectInspector soi = ioi.soi; int n = queue.size(); Object loiObj = loi.create(n); int i = 0; for (KeyValuePair pair : queue) { Object soiObj = soi.create(); soi.setStructFieldData(soiObj, soi.getStructFieldRef("value"), pair.value); soi.setStructFieldData(soiObj, soi.getStructFieldRef("key"), pair.key); loi.set(loiObj, i, soiObj); i += 1; } topLevelSoi.setStructFieldData(topLevelObj, topLevelSoi.getStructFieldRef("n"), new IntWritable(topN)); topLevelSoi.setStructFieldData(topLevelObj, topLevelSoi.getStructFieldRef("data"), loiObj); return topLevelObj; } public void merge(IntermObjectInspector ioi, Object partial) { List<Object> nestedValues = ioi.topSoi.getStructFieldsDataAsList(partial); topN = (Integer) (ioi.noi.getPrimitiveJavaObject(nestedValues.get(0))); StandardListObjectInspector loi = ioi.loi; StandardStructObjectInspector soi = ioi.soi; PrimitiveObjectInspector oiKey = ioi.oiKey; Object data = nestedValues.get(1); int n = loi.getListLength(data); int i = 0; while (i < n) { Object sValue = loi.getListElement(data, i); List<Object> innerValues = soi.getStructFieldsDataAsList(sValue); Object primValue = innerValues.get(0); Object primKey = innerValues.get(1); addToQueue(primKey, primValue, oiKey); i += 1; } } } ; @Override public AggregationBuffer getNewAggregationBuffer() throws HiveException { StdAgg result = new StdAgg(); reset(result); return result; } @Override public void reset(AggregationBuffer agg) throws HiveException { StdAgg stdAgg = (StdAgg) agg; stdAgg.reset(); } } } @Override public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException { super.init(m, parameters); if (m == Mode.PARTIAL1) { //for iterate assert (parameters.length == 3); oiValue = parameters[0]; oiKey = (PrimitiveObjectInspector) parameters[1]; topN = getTopNValue((PrimitiveObjectInspector) parameters[2]); //create type R = list(struct(keyType,valueType)) ioi = new IntermObjectInspector(getTerminatePartialOutputType(oiValue, oiKey)); //for terminate partial return ioi.topSoi;//call this type R } else if (m == Mode.PARTIAL2) { ioi = new IntermObjectInspector((StandardStructObjectInspector) parameters[0]); //type R (see above) //for merge and terminate partial return ioi.topSoi;//type R } else if (m == Mode.COMPLETE) { assert (parameters.length == 3); //for iterate oiValue = parameters[0]; oiKey = (PrimitiveObjectInspector) parameters[1]; topN = getTopNValue((PrimitiveObjectInspector) parameters[2]); ioi = new IntermObjectInspector(getTerminatePartialOutputType(oiValue, oiKey));//type R (see above) //for terminate return getCompleteOutputType(ioi); } else if (m == Mode.FINAL) { //for merge ioi = new IntermObjectInspector((StandardStructObjectInspector) parameters[0]); //type R (see above) //for terminate //type O = list(valueType) return getCompleteOutputType(ioi); } throw new IllegalStateException("Unknown mode"); }
  • 25. Hive • Join and group by are easy • Common subexpressions are not optimized • Dealing with denormalized data can be tricky • UDFs are implemented low level and need to be deployed • UDAFs (aggregation functions) require expert knowledge • Dealing with generic UDFs is no fun
  • 26. MapReduce @Override public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { return BinaryData.compare(b1, s1, l1, b2, s2, l2, pair); } public int compare(AvroKey<Pair<Long,Long>> x, AvroKey<Pair<Long,Long>> y) { return ReflectData.get().compare(x.datum(), y.datum(), pair); }
  • 27. MapReduce • Pure map and reduce is very restrictive • Map side joins require knowledge of distributed cache • Bot map and reduce side joins require assumptions about data sizes • Constant type juggling • Hard to glue together • Hard to test • Implementing secondary sorting in an AvroMapper is no fun
  • 28. Flink public static void buildJob(ExecutionEnvironment env, DataSet<Publication> publications, DataSet<AuthorAccountMapping> authorAccountMappings, OutputFormat<TopCoauthorStats> topCoauthorStats) { publications.flatMap(new FlatMapFunction<Publication, Tuple2<Long, Long>>() { @Override public void flatMap(Publication publication, Collector<Tuple2<Long, Long>> publicationAuthors) throws Exception { if (publication.getAuthorships() == null) { return; } for (Authorship authorship : publication.getAuthorships()) { if (authorship.getAuthorUid() == null) { continue; } publicationAuthors.collect(new Tuple2<>(publication.getPublicationUid(), authorship.getAuthorUid())); } } }).coGroup(authorAccountMappings).where(1) // ...
  • 29. Flink for Simple Use Cases • Fluent API • Rich set of transformations • Support for Tuples and POJOs • With some discipline separation of business logic possible • Java 7 API still requires some boilerplate • No elasticity yet • Fastest and most fun to implement
  • 30. Performance Comparison 0:00:00 0:07:12 0:14:24 0:21:36 0:28:48 0:36:00 50 100 Execution Time Hive (Tez) MapReduce Flink Performance and fun – every day

Notes de l'éditeur

  1. Both successful, but in different ways
  2. Both successful, but in different ways
  3. Both successful, but in different ways
  4. Both successful, but in different ways
  5. Both successful, but in different ways
  6. Both successful, but in different ways
  7. Both successful, but in different ways
  8. Both successful, but in different ways
  9. Both successful, but in different ways
  10. Both successful, but in different ways
  11. Both successful, but in different ways
  12. Both successful, but in different ways
  13. Both successful, but in different ways