SlideShare une entreprise Scribd logo
1  sur  36
Ken Krugler | President, Scale Unlimited
Faster Workflows, Faster
The Twitter Pitch
• Cascading is a solid, established workflow API
•Good for complex custom ETL workflows
• Flink is a new streaming dataflow engine
•50% better performance by leveraging memory
Perils of Comparisons
• Performance comparisons are clickbait for devs
•“You won’t believe the speed of Flink!”
• Really, really hard to do well
• I did “moderate” tuning of existing code base
•Somewhat complex workflow in EMR
•Dataset bigger than memory (100M…1B records)
TL;DR
• Flink gets faster by minimizing disk I/O
•Map-Reduce job always has write/read at job break
•Can also spill map output, reduce merge-sort
• Flink has no job boundaries
•And no map-side spills
•So only reduce merge-sort is extra I/O
TOC
• A very short intro to Cascading
•An even shorter intro to Flink
• An example of converting a workflow
• More in-depth results
In the beginning…
• There was Hadoop, and it was good
•But life is too short to write M-R jobs with K-V data
• Then along came Cascading…
Cascading
What is Cascading?
• A thin Java library on top of Hadoop
•An open source project (APL, 8 years old)
• An API for defining and running ETL workflows
30,000ft View
• Records (Tuples) flow through Pipes
30,000ft View
• Pipes connect Operations
30,000ft View
• You do Operations on Tuples
30,000ft View
• Tuples flow into Pipes from Source Taps
30,000ft View
• Tuples flow from Pipes into Sink Taps
30,000ft View
• This is a data processing workflow (Flow)
Java API to Define Flow
Pipe ipDataPipe = new Pipe("ip data pipe");
RegexParser ipDataParser = new RegexParser(new Fields("Data IP", "Country"), ^([d.]+)t(.*));
ipDataPipe = new Each(ipDataPipe, new Fields("line"), ipDataParser);
Pipe logAnalysisPipe = new CoGroup( logDataPipe, // left-side pipe
new Fields("Log IP"), // left-side field for joining
ipDataPipe, // right-side pipe
new Fields("Data IP"), // right-side field for joining
new LeftJoin()); // type of join to do
logAnalysisPipe = new GroupBy(logAnalysisPipe, new Fields("Country", "Status"));
logAnalysisPipe = new Every(logAnalysisPipe, new Count(new Fields(“Count")));
logAnalysisPipe = new Each(logAnalysisPipe, new Fields("country"), new Not(new RegexFilter("null")));
Tap logDataTap = new Hfs(new TextLine(), "access.log");
Tap ipDataTap = new Hfs(new TextLine(), "ip-map.tsv");
Tap outputTap = new Hfs(new TextLine(), "results");
FlowDef flowDef = new FlowDef().setName("log analysis flow")
.addSource(logDataPipe, logDataTap).addSource(ipDataPipe, ipDataTap)
.addTailSink(logAnalysisPipe, outputTap);
Flow flow = new HadoopFlowConnector(properties).connect(flowDef);
Visualizing the DAG
Things I Like
• “Stream Shaping” - easy to add, drop fields
•Field consistency checking in DAG
• Building blocks - Operations, SubAssemblies
• Flexible planner - MR, local, Tez
Time for a Change
• I’ve used Cascading for 100s of projects
•And made a lot of money consulting on ETL
• But … it was getting kind of boring
Flink
Elevator Pitch
• High throughput/low latency stream processing
•Also supports batch (bounded streams)
• Runs locally, stand-alone, or in YARN
• Super-awesome team
Versus Spark? Sigh…OK
• Very similar in many ways
•Natively streaming, vs. natively batch
• Not as mature, smaller community/ecosystem
Similar to Cascading
• You define a DAG with Java (or Scala) code
•You have data sources and sinks
• Data flows through streams to operators
• The planner turns this into a bunch of tasks
It’s Faster, But…
• I don’t want to rewrite my code
•I use lots of custom Cascading schemes
• I don’t really know Scala
•And POJOs ad nauseam are no fun
•Same for Tuple21<Integer, String, String, …>
Scala is the New APL
val input = env.readFileStream(fileName,100)
.flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } }
.timeWindowAll(Time.of(60, TimeUnit.SECONDS))
.trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(5)))
.fold(Set[String]()){(r,i) => { r + i}}
.map{x => (new Timestamp(System.currentTimeMillis()), x.size)}
Tuple21 Hell
public void reduce(Iterable<Tuple2<Tuple3<String, String, Integer>,
Tuple2<String, Integer>>> tupleGroup,
Collector<Tuple3<String, String, Double>> out) {
for (Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>> tuple : tupleGroup) {
Tuple3<String, String, Integer> idWC = tuple.f0;
Tuple2<String, Integer> idTW = tuple.f1;
out.collect(new Tuple3<String, String, Double>(idWC.f0, idWC.f1, (double)idWC.f2 /
idTW.f1));
}
}
Cascading-Flink
Cascading 3 Planner
• Converts the Cascading DAG into a Flink DAG
•Around 5K lines of code
• DAG it plans looks like Cascading local mode
Boundaries for Data Sets
•Speed == no spill to disk
•Task CPU is the same
•Other than serde time
How Painful?
• Use the FlinkFlowConnector
•Flink Flow planner to convert DAG to job
• Uber jar vs. classic Hadoop jar
• Grungy details of submitting jobs to EMR cluster
Wikiwords Workflow
• Find association between terms and categories
•For every page in Wikipedia, for every term
•Find distance from term to intra-wiki links
• Then calc statistics to find “strong association”
•Prob unusually high that term is close to link
Timing Test Details
• EMR cluster with 5 i2.xlarge slaves
•~1 billion input records (term, article ref, distance)
• Hadoop MapReduce took 148 minutes
• Flink took 98 minutes
•So 1.5x faster - nice but not great
•Mostly due to spillage in many boundaries
Summary
If You’re a Java ELT Dev
• And you have to deal with batch big data
•Then the Cascading API is a good fit
• And using Flink typically gives better performance
• While still using a standard Hadoop/YARN cluster
Status of Cascading-Flink
• Still young, but surprisingly robust
•Doesn’t support Full or RightOuter HashJoins
• Pending optimizations
•Tuple serialization
•Flink improvements
Better Planning…
•Defer late-stage join
•Avoid premature resource usage
More questions?
• Feel free to contact me
•http://www.scaleunlimited.com/contact/
•ken@scaleunlimited.com
• Check out Cascading, Flink, and Cascading-Flink
•http://www.cascading.org
•http://flink.apache.org
•http://github.com/dataArtisans/cascading-flink

Contenu connexe

Tendances

The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015Johan
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and SparkJosef Adersberger
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxData
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Time Series Data with InfluxDB
Time Series Data with InfluxDBTime Series Data with InfluxDB
Time Series Data with InfluxDBTuri, Inc.
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData InfluxData
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormAndrea Iacono
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Dan Lynn
 
Aggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream WindowsAggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream WindowsParis Carbone
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster packageAlberto Labarga
 
Flux and InfluxDB 2.0
Flux and InfluxDB 2.0Flux and InfluxDB 2.0
Flux and InfluxDB 2.0InfluxData
 
Internship - Final Presentation (26-08-2015)
Internship - Final Presentation (26-08-2015)Internship - Final Presentation (26-08-2015)
Internship - Final Presentation (26-08-2015)Sean Krail
 
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiUsing Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiInfluxData
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkSpark Summit
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 

Tendances (20)

The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Time Series Data with InfluxDB
Time Series Data with InfluxDBTime Series Data with InfluxDB
Time Series Data with InfluxDB
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Influxdb and time series data
Influxdb and time series dataInfluxdb and time series data
Influxdb and time series data
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
 
Data recovery using pg_filedump
Data recovery using pg_filedumpData recovery using pg_filedump
Data recovery using pg_filedump
 
Aggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream WindowsAggregate Sharing for User-Define Data Stream Windows
Aggregate Sharing for User-Define Data Stream Windows
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster package
 
Flux and InfluxDB 2.0
Flux and InfluxDB 2.0Flux and InfluxDB 2.0
Flux and InfluxDB 2.0
 
pg_filedump
pg_filedumppg_filedump
pg_filedump
 
Internship - Final Presentation (26-08-2015)
Internship - Final Presentation (26-08-2015)Internship - Final Presentation (26-08-2015)
Internship - Final Presentation (26-08-2015)
 
The new time series kid on the block
The new time series kid on the blockThe new time series kid on the block
The new time series kid on the block
 
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob LisiUsing Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 

En vedette

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraKen Krugler
 
Consumer Driven Contracts and Your Microservice Architecture
Consumer Driven Contracts and Your Microservice ArchitectureConsumer Driven Contracts and Your Microservice Architecture
Consumer Driven Contracts and Your Microservice ArchitectureMarcin Grzejszczak
 
The Power of Story: Why you are losing clients and the best staff
The Power of Story: Why you are losing clients and the best staffThe Power of Story: Why you are losing clients and the best staff
The Power of Story: Why you are losing clients and the best staffRipMedia Group,
 
Как построить сверхмобильную компанию
Как построить сверхмобильную компаниюКак построить сверхмобильную компанию
Как построить сверхмобильную компаниюIngria. Technopark St. Petersburg
 
Student work gallery
Student work galleryStudent work gallery
Student work gallerySelena Knight
 
Как запустить рекламную кампанию?
Как запустить рекламную кампанию?Как запустить рекламную кампанию?
Как запустить рекламную кампанию?Ingria. Technopark St. Petersburg
 
PSY 290 Library Instruction
PSY 290 Library InstructionPSY 290 Library Instruction
PSY 290 Library InstructionDanielle Carlock
 
Critical thinking and e portfolios orna farrell final
Critical thinking and e portfolios orna farrell finalCritical thinking and e portfolios orna farrell final
Critical thinking and e portfolios orna farrell finalOrna Farrell
 
Citrex610 m-en-citrex carne
Citrex610 m-en-citrex carneCitrex610 m-en-citrex carne
Citrex610 m-en-citrex carneCITREX
 
Итоговая презентация конкурса Web Ready: отчет для партнеров
Итоговая презентация конкурса Web Ready: отчет для партнеровИтоговая презентация конкурса Web Ready: отчет для партнеров
Итоговая презентация конкурса Web Ready: отчет для партнеровIngria. Technopark St. Petersburg
 
Kriton konser resimleri
Kriton konser resimleriKriton konser resimleri
Kriton konser resimleriaokutur
 
Public Cloud - Add New Revenue & Profitability to Your Existing Hosting Business
Public Cloud - Add New Revenue & Profitability to Your Existing Hosting BusinessPublic Cloud - Add New Revenue & Profitability to Your Existing Hosting Business
Public Cloud - Add New Revenue & Profitability to Your Existing Hosting BusinessCraig Deveson
 
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser ResimleriSerap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleriaokutur
 
С праздником 8 марта!
С праздником 8 марта!С праздником 8 марта!
С праздником 8 марта!Vadim Zhartun
 
Developing Financial Capability
Developing Financial CapabilityDeveloping Financial Capability
Developing Financial CapabilityPaul Carpenter
 

En vedette (20)

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
 
Consumer Driven Contracts and Your Microservice Architecture
Consumer Driven Contracts and Your Microservice ArchitectureConsumer Driven Contracts and Your Microservice Architecture
Consumer Driven Contracts and Your Microservice Architecture
 
The Power of Story: Why you are losing clients and the best staff
The Power of Story: Why you are losing clients and the best staffThe Power of Story: Why you are losing clients and the best staff
The Power of Story: Why you are losing clients and the best staff
 
Как построить сверхмобильную компанию
Как построить сверхмобильную компаниюКак построить сверхмобильную компанию
Как построить сверхмобильную компанию
 
Student work gallery
Student work galleryStudent work gallery
Student work gallery
 
Как запустить рекламную кампанию?
Как запустить рекламную кампанию?Как запустить рекламную кампанию?
Как запустить рекламную кампанию?
 
PSY 290 Library Instruction
PSY 290 Library InstructionPSY 290 Library Instruction
PSY 290 Library Instruction
 
Critical thinking and e portfolios orna farrell final
Critical thinking and e portfolios orna farrell finalCritical thinking and e portfolios orna farrell final
Critical thinking and e portfolios orna farrell final
 
Citrex610 m-en-citrex carne
Citrex610 m-en-citrex carneCitrex610 m-en-citrex carne
Citrex610 m-en-citrex carne
 
大地360
大地360大地360
大地360
 
Итоговая презентация конкурса Web Ready: отчет для партнеров
Итоговая презентация конкурса Web Ready: отчет для партнеровИтоговая презентация конкурса Web Ready: отчет для партнеров
Итоговая презентация конкурса Web Ready: отчет для партнеров
 
Oliver Cromwell
Oliver CromwellOliver Cromwell
Oliver Cromwell
 
Ariadn [Web Ready 2010]
Ariadn [Web Ready 2010]Ariadn [Web Ready 2010]
Ariadn [Web Ready 2010]
 
Kriton konser resimleri
Kriton konser resimleriKriton konser resimleri
Kriton konser resimleri
 
Public Cloud - Add New Revenue & Profitability to Your Existing Hosting Business
Public Cloud - Add New Revenue & Profitability to Your Existing Hosting BusinessPublic Cloud - Add New Revenue & Profitability to Your Existing Hosting Business
Public Cloud - Add New Revenue & Profitability to Your Existing Hosting Business
 
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser ResimleriSerap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
 
С праздником 8 марта!
С праздником 8 марта!С праздником 8 марта!
С праздником 8 марта!
 
BabbleLABEL [Web Ready 2010]
BabbleLABEL [Web Ready 2010]BabbleLABEL [Web Ready 2010]
BabbleLABEL [Web Ready 2010]
 
Developing Financial Capability
Developing Financial CapabilityDeveloping Financial Capability
Developing Financial Capability
 

Similaire à Faster Workflows, Faster

Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalCodemotion Tel Aviv
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018Zahari Dichev
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)Ortus Solutions, Corp
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...Ortus Solutions, Corp
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 

Similaire à Faster Workflows, Faster (20)

Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
 
Heatmap
HeatmapHeatmap
Heatmap
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 

Plus de Ken Krugler

Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrKen Krugler
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorialKen Krugler
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to HadoopKen Krugler
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big dataKen Krugler
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoopKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 

Plus de Ken Krugler (7)

Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 

Dernier

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 

Dernier (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Faster Workflows, Faster

  • 1. Ken Krugler | President, Scale Unlimited Faster Workflows, Faster
  • 2. The Twitter Pitch • Cascading is a solid, established workflow API •Good for complex custom ETL workflows • Flink is a new streaming dataflow engine •50% better performance by leveraging memory
  • 3. Perils of Comparisons • Performance comparisons are clickbait for devs •“You won’t believe the speed of Flink!” • Really, really hard to do well • I did “moderate” tuning of existing code base •Somewhat complex workflow in EMR •Dataset bigger than memory (100M…1B records)
  • 4. TL;DR • Flink gets faster by minimizing disk I/O •Map-Reduce job always has write/read at job break •Can also spill map output, reduce merge-sort • Flink has no job boundaries •And no map-side spills •So only reduce merge-sort is extra I/O
  • 5. TOC • A very short intro to Cascading •An even shorter intro to Flink • An example of converting a workflow • More in-depth results
  • 6. In the beginning… • There was Hadoop, and it was good •But life is too short to write M-R jobs with K-V data • Then along came Cascading…
  • 8. What is Cascading? • A thin Java library on top of Hadoop •An open source project (APL, 8 years old) • An API for defining and running ETL workflows
  • 9. 30,000ft View • Records (Tuples) flow through Pipes
  • 10. 30,000ft View • Pipes connect Operations
  • 11. 30,000ft View • You do Operations on Tuples
  • 12. 30,000ft View • Tuples flow into Pipes from Source Taps
  • 13. 30,000ft View • Tuples flow from Pipes into Sink Taps
  • 14. 30,000ft View • This is a data processing workflow (Flow)
  • 15. Java API to Define Flow Pipe ipDataPipe = new Pipe("ip data pipe"); RegexParser ipDataParser = new RegexParser(new Fields("Data IP", "Country"), ^([d.]+)t(.*)); ipDataPipe = new Each(ipDataPipe, new Fields("line"), ipDataParser); Pipe logAnalysisPipe = new CoGroup( logDataPipe, // left-side pipe new Fields("Log IP"), // left-side field for joining ipDataPipe, // right-side pipe new Fields("Data IP"), // right-side field for joining new LeftJoin()); // type of join to do logAnalysisPipe = new GroupBy(logAnalysisPipe, new Fields("Country", "Status")); logAnalysisPipe = new Every(logAnalysisPipe, new Count(new Fields(“Count"))); logAnalysisPipe = new Each(logAnalysisPipe, new Fields("country"), new Not(new RegexFilter("null"))); Tap logDataTap = new Hfs(new TextLine(), "access.log"); Tap ipDataTap = new Hfs(new TextLine(), "ip-map.tsv"); Tap outputTap = new Hfs(new TextLine(), "results"); FlowDef flowDef = new FlowDef().setName("log analysis flow") .addSource(logDataPipe, logDataTap).addSource(ipDataPipe, ipDataTap) .addTailSink(logAnalysisPipe, outputTap); Flow flow = new HadoopFlowConnector(properties).connect(flowDef);
  • 17. Things I Like • “Stream Shaping” - easy to add, drop fields •Field consistency checking in DAG • Building blocks - Operations, SubAssemblies • Flexible planner - MR, local, Tez
  • 18. Time for a Change • I’ve used Cascading for 100s of projects •And made a lot of money consulting on ETL • But … it was getting kind of boring
  • 19. Flink
  • 20. Elevator Pitch • High throughput/low latency stream processing •Also supports batch (bounded streams) • Runs locally, stand-alone, or in YARN • Super-awesome team
  • 21. Versus Spark? Sigh…OK • Very similar in many ways •Natively streaming, vs. natively batch • Not as mature, smaller community/ecosystem
  • 22. Similar to Cascading • You define a DAG with Java (or Scala) code •You have data sources and sinks • Data flows through streams to operators • The planner turns this into a bunch of tasks
  • 23. It’s Faster, But… • I don’t want to rewrite my code •I use lots of custom Cascading schemes • I don’t really know Scala •And POJOs ad nauseam are no fun •Same for Tuple21<Integer, String, String, …>
  • 24. Scala is the New APL val input = env.readFileStream(fileName,100) .flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } } .timeWindowAll(Time.of(60, TimeUnit.SECONDS)) .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(5))) .fold(Set[String]()){(r,i) => { r + i}} .map{x => (new Timestamp(System.currentTimeMillis()), x.size)}
  • 25. Tuple21 Hell public void reduce(Iterable<Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>>> tupleGroup, Collector<Tuple3<String, String, Double>> out) { for (Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>> tuple : tupleGroup) { Tuple3<String, String, Integer> idWC = tuple.f0; Tuple2<String, Integer> idTW = tuple.f1; out.collect(new Tuple3<String, String, Double>(idWC.f0, idWC.f1, (double)idWC.f2 / idTW.f1)); } }
  • 27. Cascading 3 Planner • Converts the Cascading DAG into a Flink DAG •Around 5K lines of code • DAG it plans looks like Cascading local mode
  • 28. Boundaries for Data Sets •Speed == no spill to disk •Task CPU is the same •Other than serde time
  • 29. How Painful? • Use the FlinkFlowConnector •Flink Flow planner to convert DAG to job • Uber jar vs. classic Hadoop jar • Grungy details of submitting jobs to EMR cluster
  • 30. Wikiwords Workflow • Find association between terms and categories •For every page in Wikipedia, for every term •Find distance from term to intra-wiki links • Then calc statistics to find “strong association” •Prob unusually high that term is close to link
  • 31. Timing Test Details • EMR cluster with 5 i2.xlarge slaves •~1 billion input records (term, article ref, distance) • Hadoop MapReduce took 148 minutes • Flink took 98 minutes •So 1.5x faster - nice but not great •Mostly due to spillage in many boundaries
  • 33. If You’re a Java ELT Dev • And you have to deal with batch big data •Then the Cascading API is a good fit • And using Flink typically gives better performance • While still using a standard Hadoop/YARN cluster
  • 34. Status of Cascading-Flink • Still young, but surprisingly robust •Doesn’t support Full or RightOuter HashJoins • Pending optimizations •Tuple serialization •Flink improvements
  • 35. Better Planning… •Defer late-stage join •Avoid premature resource usage
  • 36. More questions? • Feel free to contact me •http://www.scaleunlimited.com/contact/ •ken@scaleunlimited.com • Check out Cascading, Flink, and Cascading-Flink •http://www.cascading.org •http://flink.apache.org •http://github.com/dataArtisans/cascading-flink

Notes de l'éditeur

  1. Before we go further, where did I get that 50% faster number? Did you really tune both systems appropriately? And what’s a reasonable amount of tuning? What version did you use? Oh, there’s a new release that’s way faster. Both are fast-moving targets. Need to test with data at scale. Systems like Flink & Spark make it harder, because if it fits in memory then it’s much faster But for smallish data sets, who cares if it goes from 15 minutes to 5 minutes, or even 1 minute? Now for a number of ML algorithms that is significant, since you’re iterating. But I’m talking about classic ETL stuff, so you have to do runs that take hours, where data spills to disk, and you’re stressing the system. Workflow was 12 Hadoop jobs, with moderate parallelism And by bigger than memory, I mean that many of the tasks in the flow were processing 100M -> 1B records. To me, represents a more typical ETL vs. big reduction early in workflow. Many newer systems focus on datasets that fit in memory, where they really fly (e.g. ML)
  2. You have to ultimately read the source data, and write the sink data. If the jobs are CPU-bound, then Flink is no faster. The code doesn’t magically get more efficient because Flink is “newer”. Because we’re talking about batch workflows here (not streaming) any group/join has to accumulate all the data in memory.
  3. I've been to many conferences, where I’m at a talk that isn’t right for me. But stuck too close to the front. I'm going to wet my whistle, I won't be offended if you get up and walk out
  4. At my previous startup (Krugle) we were using Hadoop to process open source code. Unless you’re just coding up the Hello World of big data - aka word count.
  5. Not an Apache project
  6. A Tuple is just a list of values.
  7. A pipe has a list of field names & types (like the header row in a CSV file)
  8. I prefer a more fluid builder approach. Cascading has a start on this, but after using Flink I think I now know what I want it to look like.
  9. Without this approach, you’re defining many POJOs (Crunch) or using field indexes to generic arrays of data You know if a field is missing in a downstream operation when the graph is built, versus at run-time Encourages code re-use You can run Cascading workflows locally, or using map-reduce, or (now) on Tez.
  10. Been using Cascading for more than 6 years. So I started checking out streaming options - Storm, then Samza, looked at Kafka streaming.
  11. Apache project, of course Flink is relatively new - incubator about 2 years ago, graduated to TLP 6 months later.
  12. I can’t say I’ve used either enough to feel good about summarizing. Similar in many ways, focusing on leveraging lots of RAM, iterative calculations, Scala/Java mix, streaming & batch modes Both are moving targets, to comparisons are hard - e.g. In the past Flink had better memory management, but Spark’s Project Tungsten has improved things a lot.
  13. Remember all those nice building blocks from Cascading? Stream shaping means defining a new POJO class for each such change
  14. When I go back to look at non-trivial code, I’m like WTF???
  15. Tuple of tuple of tuple… So then I found out that there was this project called Cascading-Flink
  16. It’s all a single step
  17. The graph is the complete flow. Green is where we’re reading from/writing to HDFS, yellow are where data has to collect (group/join operations) So the only explicit read/write ops happen at top/bottom, unlike Hadoop where there are twelve jobs, each with an HDFS write-the-read boundary. Group/join are where data accumulates in memory, and might have to spill to disk if it gets too big. So the performance win is from avoiding disk I/O (writing/reading). Processing time is going to be roughly the same, other than less spills also means less serialization/deserialization.
  18. Pretty much a one line change - honestly. We leverage a cascading.utils library that hides some of this from us. I know, many people say don’t use a Hadoop jar, but it’s proven the most portable/reliable approach for us. Used Hadoop 2.7 in EMR
  19. Out of all of the terms that are “close to” a link to a page (across all Wikipedia pages), this term has an unusually high probability. And once I’ve got that, plus the categories a page belongs to, I can calculate this same associative value for terms to categories. And with that data in hand, I can extract all the terms from some arbitrary page, sum the category association strengths for any that are known. This gives me a set of categories with scores for an arbitrary chunk of text. That’s the theory, I’m still trying this out to see if it’s really effective.
  20. And time spent doing slower/more general serialization/deserialization using Kryo
  21. I’ve run into some issues but devs have been very responsive in fixing them. Any improvements in Flink that help DataSet (batch) will help Cascading
  22. This would be a change in Flink, not Cascading-Flink