Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Faster Workflows, Faster
1. Ken Krugler | President, Scale Unlimited
Faster Workflows, Faster
2. The Twitter Pitch
• Cascading is a solid, established workflow API
•Good for complex custom ETL workflows
• Flink is a new streaming dataflow engine
•50% better performance by leveraging memory
3. Perils of Comparisons
• Performance comparisons are clickbait for devs
•“You won’t believe the speed of Flink!”
• Really, really hard to do well
• I did “moderate” tuning of existing code base
•Somewhat complex workflow in EMR
•Dataset bigger than memory (100M…1B records)
4. TL;DR
• Flink gets faster by minimizing disk I/O
•Map-Reduce job always has write/read at job break
•Can also spill map output, reduce merge-sort
• Flink has no job boundaries
•And no map-side spills
•So only reduce merge-sort is extra I/O
5. TOC
• A very short intro to Cascading
•An even shorter intro to Flink
• An example of converting a workflow
• More in-depth results
6. In the beginning…
• There was Hadoop, and it was good
•But life is too short to write M-R jobs with K-V data
• Then along came Cascading…
8. What is Cascading?
• A thin Java library on top of Hadoop
•An open source project (APL, 8 years old)
• An API for defining and running ETL workflows
15. Java API to Define Flow
Pipe ipDataPipe = new Pipe("ip data pipe");
RegexParser ipDataParser = new RegexParser(new Fields("Data IP", "Country"), ^([d.]+)t(.*));
ipDataPipe = new Each(ipDataPipe, new Fields("line"), ipDataParser);
Pipe logAnalysisPipe = new CoGroup( logDataPipe, // left-side pipe
new Fields("Log IP"), // left-side field for joining
ipDataPipe, // right-side pipe
new Fields("Data IP"), // right-side field for joining
new LeftJoin()); // type of join to do
logAnalysisPipe = new GroupBy(logAnalysisPipe, new Fields("Country", "Status"));
logAnalysisPipe = new Every(logAnalysisPipe, new Count(new Fields(“Count")));
logAnalysisPipe = new Each(logAnalysisPipe, new Fields("country"), new Not(new RegexFilter("null")));
Tap logDataTap = new Hfs(new TextLine(), "access.log");
Tap ipDataTap = new Hfs(new TextLine(), "ip-map.tsv");
Tap outputTap = new Hfs(new TextLine(), "results");
FlowDef flowDef = new FlowDef().setName("log analysis flow")
.addSource(logDataPipe, logDataTap).addSource(ipDataPipe, ipDataTap)
.addTailSink(logAnalysisPipe, outputTap);
Flow flow = new HadoopFlowConnector(properties).connect(flowDef);
17. Things I Like
• “Stream Shaping” - easy to add, drop fields
•Field consistency checking in DAG
• Building blocks - Operations, SubAssemblies
• Flexible planner - MR, local, Tez
18. Time for a Change
• I’ve used Cascading for 100s of projects
•And made a lot of money consulting on ETL
• But … it was getting kind of boring
20. Elevator Pitch
• High throughput/low latency stream processing
•Also supports batch (bounded streams)
• Runs locally, stand-alone, or in YARN
• Super-awesome team
21. Versus Spark? Sigh…OK
• Very similar in many ways
•Natively streaming, vs. natively batch
• Not as mature, smaller community/ecosystem
22. Similar to Cascading
• You define a DAG with Java (or Scala) code
•You have data sources and sinks
• Data flows through streams to operators
• The planner turns this into a bunch of tasks
23. It’s Faster, But…
• I don’t want to rewrite my code
•I use lots of custom Cascading schemes
• I don’t really know Scala
•And POJOs ad nauseam are no fun
•Same for Tuple21<Integer, String, String, …>
24. Scala is the New APL
val input = env.readFileStream(fileName,100)
.flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } }
.timeWindowAll(Time.of(60, TimeUnit.SECONDS))
.trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(5)))
.fold(Set[String]()){(r,i) => { r + i}}
.map{x => (new Timestamp(System.currentTimeMillis()), x.size)}
27. Cascading 3 Planner
• Converts the Cascading DAG into a Flink DAG
•Around 5K lines of code
• DAG it plans looks like Cascading local mode
28. Boundaries for Data Sets
•Speed == no spill to disk
•Task CPU is the same
•Other than serde time
29. How Painful?
• Use the FlinkFlowConnector
•Flink Flow planner to convert DAG to job
• Uber jar vs. classic Hadoop jar
• Grungy details of submitting jobs to EMR cluster
30. Wikiwords Workflow
• Find association between terms and categories
•For every page in Wikipedia, for every term
•Find distance from term to intra-wiki links
• Then calc statistics to find “strong association”
•Prob unusually high that term is close to link
31. Timing Test Details
• EMR cluster with 5 i2.xlarge slaves
•~1 billion input records (term, article ref, distance)
• Hadoop MapReduce took 148 minutes
• Flink took 98 minutes
•So 1.5x faster - nice but not great
•Mostly due to spillage in many boundaries
33. If You’re a Java ELT Dev
• And you have to deal with batch big data
•Then the Cascading API is a good fit
• And using Flink typically gives better performance
• While still using a standard Hadoop/YARN cluster
34. Status of Cascading-Flink
• Still young, but surprisingly robust
•Doesn’t support Full or RightOuter HashJoins
• Pending optimizations
•Tuple serialization
•Flink improvements
36. More questions?
• Feel free to contact me
•http://www.scaleunlimited.com/contact/
•ken@scaleunlimited.com
• Check out Cascading, Flink, and Cascading-Flink
•http://www.cascading.org
•http://flink.apache.org
•http://github.com/dataArtisans/cascading-flink
Notes de l'éditeur
Before we go further, where did I get that 50% faster number?
Did you really tune both systems appropriately? And what’s a reasonable amount of tuning?
What version did you use? Oh, there’s a new release that’s way faster. Both are fast-moving targets.
Need to test with data at scale. Systems like Flink & Spark make it harder, because if it fits in memory then it’s much faster
But for smallish data sets, who cares if it goes from 15 minutes to 5 minutes, or even 1 minute?
Now for a number of ML algorithms that is significant, since you’re iterating.
But I’m talking about classic ETL stuff, so you have to do runs that take hours, where data spills to disk, and you’re stressing the system.
Workflow was 12 Hadoop jobs, with moderate parallelism
And by bigger than memory, I mean that many of the tasks in the flow were processing 100M -> 1B records.
To me, represents a more typical ETL vs. big reduction early in workflow.
Many newer systems focus on datasets that fit in memory, where they really fly (e.g. ML)
You have to ultimately read the source data, and write the sink data.
If the jobs are CPU-bound, then Flink is no faster. The code doesn’t magically get more efficient because Flink is “newer”.
Because we’re talking about batch workflows here (not streaming) any group/join has to accumulate all the data in memory.
I've been to many conferences, where I’m at a talk that isn’t right for me.
But stuck too close to the front.
I'm going to wet my whistle, I won't be offended if you get up and walk out
At my previous startup (Krugle) we were using Hadoop to process open source code.
Unless you’re just coding up the Hello World of big data - aka word count.
Not an Apache project
A Tuple is just a list of values.
A pipe has a list of field names & types (like the header row in a CSV file)
I prefer a more fluid builder approach.
Cascading has a start on this, but after using Flink I think I now know what I want it to look like.
Without this approach, you’re defining many POJOs (Crunch) or using field indexes to generic arrays of data
You know if a field is missing in a downstream operation when the graph is built, versus at run-time
Encourages code re-use
You can run Cascading workflows locally, or using map-reduce, or (now) on Tez.
Been using Cascading for more than 6 years.
So I started checking out streaming options - Storm, then Samza, looked at Kafka streaming.
Apache project, of course
Flink is relatively new - incubator about 2 years ago, graduated to TLP 6 months later.
I can’t say I’ve used either enough to feel good about summarizing.
Similar in many ways, focusing on leveraging lots of RAM, iterative calculations, Scala/Java mix, streaming & batch modes
Both are moving targets, to comparisons are hard - e.g. In the past Flink had better memory management, but Spark’s
Project Tungsten has improved things a lot.
Remember all those nice building blocks from Cascading?
Stream shaping means defining a new POJO class for each such change
When I go back to look at non-trivial code, I’m like WTF???
Tuple of tuple of tuple…
So then I found out that there was this project called Cascading-Flink
It’s all a single step
The graph is the complete flow.
Green is where we’re reading from/writing to HDFS, yellow are where data has to collect (group/join operations)
So the only explicit read/write ops happen at top/bottom, unlike Hadoop where there are twelve jobs, each with an HDFS write-the-read boundary.
Group/join are where data accumulates in memory, and might have to spill to disk if it gets too big.
So the performance win is from avoiding disk I/O (writing/reading).
Processing time is going to be roughly the same, other than less spills also means less serialization/deserialization.
Pretty much a one line change - honestly. We leverage a cascading.utils library that hides some of this from us.
I know, many people say don’t use a Hadoop jar, but it’s proven the most portable/reliable approach for us.
Used Hadoop 2.7 in EMR
Out of all of the terms that are “close to” a link to a page (across all Wikipedia pages), this term has an unusually high probability.
And once I’ve got that, plus the categories a page belongs to, I can calculate this same associative value for terms to categories.
And with that data in hand, I can extract all the terms from some arbitrary page, sum the category association strengths for any that are known.
This gives me a set of categories with scores for an arbitrary chunk of text.
That’s the theory, I’m still trying this out to see if it’s really effective.
And time spent doing slower/more general serialization/deserialization using Kryo
I’ve run into some issues but devs have been very responsive in fixing them.
Any improvements in Flink that help DataSet (batch) will help Cascading
This would be a change in Flink, not Cascading-Flink