Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster.
2. Agenda
•Introduction
• How it works
•Data Processing
•Advanced Processing
•Monitoring
•Testing
•Best Practices
•Cascading GUI
Trend Micro Confidential
3. Introduction
•Hadoop coding is non-trivial
•Hadoop is looking for a class to do Map steps and a
class to do Reduce step
•What if you need multiple in your application?
Who coordinates what can be run in parallel?
•What if you need to do non-Hadoop logic between
Hadoop steps?
•Chain the Operations into data processing work-
flows
Trend Micro Confidential
5. Introduction
Pipe lhs = new Pipe( "lhs" );
lhs = new Each( lhs, new SomeFunction() );
lhs = new Each( lhs, new SomeFilter() );
// the "right hand side" assembly head
Pipe rhs = new Pipe( "rhs" );
rhs = new Each( rhs, new SomeFunction() );
// joins the lhs and rhs
Pipe join = new CoGroup( lhs, rhs );
join = new Every( join, new SomeAggregator() );
join = new GroupBy( join );
join = new Every( join, new SomeAggregator() );
// the tail of the assembly
join = new Each( join, new SomeFunction() );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( “join", source, sink, join);
// execute the flow, block until complete
flow.complete();
Trend Micro Confidential
6. How it works
•Pipe Assemblies become Flows
•Translates a DAG of operations to a DAG of
MapReduce jobs
•All MapReduce jobs in Flow scheduled in
dependency order
Trend Micro Confidential
8. Data Processing
•Tuple
•A single ‘row’ of data being processed
•Each column is named
•Can access data by name or position
Trend Micro Confidential
9. Data Processing
•TAP
•Abstraction on top of Hadoop files
•Allows you to define own parser for files
•Example:
•Scheme
•TextLine
•TextDelimited
•SequenceFile
•WritableSequenceFile
Hfs input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);
Trend Micro Confidential
10. Data Processing
• Tap
•LFS
•DFS
•HFS
•MultiSourceTap
•MultiSinkTap
•TemplateTap
•GlobHfs
•S3fs(Deprecated)
Trend Micro Confidential
11. Data Processing
• TemplateTap
TemplateTap can be used to write tuple streams
out to subdirectories based on the values in
the Tuple instance.
Trend Micro Confidential
12. Data Processing
• TemplateTap
TextDelimited scheme = new TextDelimited( new Fields( "year",
"month", "entry" ), "t" );
Hfs tap = new Hfs( scheme, path );
String template = "%s-%s"; // dirs named "year-month"
Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );
Trend Micro Confidential
16. Data Processing
• Pipe
• a base class for core processing model types
• Each
• for each “tuple” in data do this to it
• GroupBy
• similar to a ‘group by’ in SQL
• CoGroup
• joins of tuple streams together
• Every
• applies an Aggregator (like count, or sum) or Buffer (a sliding
window) Operation to every group of Tuples that pass through
it.
• SubAssembly
• allows for nesting reusable pipe assemblies into a Pipe class
Trend Micro Confidential
17. Data Processing
• CoGroup
• InnerJoin
• OuterJoin
• LeftJoin
• RightJoin
• MixedJoin
lhsFields new Fields("url", "word", “count");
Fields common = new Fields( "url" );
rhsFields = new Fields("url", “sentence", “count");
Fields declared = new Fields( "url1", "word", "wd_count", "url2", "sentence", "snt_count" );
Pipe join = new CoGroup( lhs, common, rhs, common, declared, new InnerJoin() );
lhsFields, rhsFields, new InnerJoin() );
Trend Micro Confidential
18. Data Processing
•Operation
•Define what to do on the data
•Each operations allow logic on the row, such a
parsing dates, creating new attributes etc.
•Every operations allow you to iterate over the
‘group’ of rows to do non-trivial operations.
Trend Micro Confidential
19. Data Processing
•Function
•Identity Function
•Debug Function
•Sample and Limit Functions
•Insert Function
•Text Functions
•Regular Expression Operations
•Java Expression Operations
•"first-name" is a valid field name for use
with Cascading, but this expression, first-
name.trim(), will fail.
Trend Micro Confidential
22. Data Processing
•Buffer
•It is very similar to the typical Reducer
interface
•It is very useful when header or footer values
need to be inserted into a grouping, or if values
need to be inserted into the middle of the
group values
Trend Micro Confidential
25. Data Processing
•Flow
•To create a Flow, it must be planned though
the FlowConnector object. The connect()
method is used to create new Flow instances
based on a set of sink Taps, source Taps, and a
pipe assembly.
Flow flow = new FlowConnector(new Properties()).connect( "flow-name",
source, sink, pipe );
flow.complete();
Trend Micro Confidential
26. Data Processing
•MapReduceFlow
•a Flow subclass that supports custom
MapReduce jobs pre-configured via the
JobConf object.
• ProcessFlow
• a Flow subclass that supports custom Riffle
jobs.
Trend Micro Confidential
27. Data Processing
•Cascades
•Groups of Flow are called Cascades
•Custom MapReduce jobs can participate in
Cascade
Cascade cascade = cascadeConnector.connect(flow1, flow2, flow3);
cascade.complete();
Trend Micro Confidential
28. Advanced Processing
•Stream Assertions
•Unit and Regression tests for Flows
•Planner can remove ‘strict’, ‘validating’, or all
assertions
Trend Micro Confidential
29. Advanced Processing
•Failure Traps
•Catch data causing Operations or Assertions to
fail
•Allows processes to continue without data loss
Trend Micro Confidential
30. Advanced Processing
•Partial Aggregation instead of Combiners
•trade Memory for IO gains by caching values
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size" );
assembly = new SumBy( assembly, groupingFields, valueField,
sumField, long.class );
Trend Micro Confidential
33. Testing
•Use ClusterTestCase if you want to launch an
embedded Hadoop cluster inside your TestCase
•A few validation and hadoop functions are
provided
•Doesn’t support Hadoop 0.21 testing library
Trend Micro Confidential
34. Cascading GUI
•Yahoo Pipes
Pipes is a powerful composition tool to aggregate,
manipulate, and mashup content from around the
web.
Trend Micro Confidential
35. Cascading GUI
•WireIt
WireIt is an open-source javascript library to create
web wirable interfaces for dataflow applications,
visual programming languages, graphical modeling,
or graph editors.
Trend Micro Confidential
Cascading and its extensions have their own Maven/Ivy Jar repositoryThis 1.2 release will run against hadoop 0.19.x, and 0.20.x. Including Amazon Elastic MapReduce. And 0.21Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.At one level Cascading is a MapReduce query planner, just like PIG. Except the Cascading API is for public consumption and fully extensiblein PIG you typically interact with the PigLatin text syntax. With Cascading, you can layer your own syntax on top of the APIGiven a data set and you want to run a number of groupBys i.e. group by key1, generate value1, ... group by keyN, generate valueN, Cascading primary programming model is similar to PIG but with a Java API.Pig would optimize from N to smaller (e.g. 1) number of reduce runsOozie workflows are actions arranged in a control dependency DAG (Direct Acyclic Graph).Cascading runs as a client from the command lineOozieis a server system (like Hadoop Job Tracker) to which you submit workflow jobs and later check the status.
By providing a clean API to the core Cascading model, tools like Jython, Groovy, and JRuby can be used instead to define complex processing flow
The MapReduce Job Planner is an internal feature of Cascading.Every job is delimited by a temporary file that is the sink from the first job, and then the source to the next job.temporary file will be deleted whether the flow runs successfully or failed. However, it’s configurable.If two or more Flow instances have no dependencies, they will be submitted together so they can execute in parallel.DAG : directed acyclic graph : 不循環有向圖an internal graph that makes each Flow a 'vertex', and each file an 'edge‘When a vertex has all it's incoming edges (files) available, it will be scheduled on the cluster.TopologicalOrderAnd by default, if any outputs from a Flow are newer than the inputs, the Flow is skippedI can’t customize combiner and partitioner
7 tools can parse the dot file.DOT is a plain text graph description language. To see how your Flows are partitioned, call the Flow#writeDOT() method. This will write a DOT fileThe writeDOTapi isn’t useful for logging
All Taps must have a Scheme associated with them. If the Tap is about where the data is, and how to get it, the Scheme is about what the data is.TextLineTextLine reads and writes raw text files and returns Tuples with two field names by default, "offset" and "line".TextDelimited(csv, tsv, etc)SequenceFile - SequenceFile is based on the Hadoop Sequence file, which is a binary format.WritableSequenceFile - like the SequenceFile Scheme, except it was designed to read and write key and/or value Hadoop Writable objects directly.
MultiSourceTapThe cascading.tap.MultiSourceTap is used to tie multiple Tap instances into a single Tap for use as an input source. The only restriction is that all the Tap instances passed to a new MultiSourceTap share the same Scheme classes (not necessarily the same Scheme instance).MultiSinkTapThe cascading.tap.MultiSinkTap is used to tie multiple Tap instances into a single Tap for use as an output sink. During runtime, for every Tuple output by the pipe assembly each child tap to the MultiSinkTap will sink the Tuple.TemplateTapTemplateTap can be used to write tuple streams out to subdirectories based on the values in the Tuple instance. The constructor takes a HfsTap and a Formatter format syntax String. This allows Tuple values at given positions to be used as directory names. Note that Hadoop can only sink to directories, and all files in those directories are "part-xxxxx" files. openTapsThreshold limits the number of open files to be output to. This value defaults to 300 files. Each time the threshold is exceeded, 10% of the least recently used open files will be closed. TextDelimited scheme = new TextDelimited( new Fields( "year", "month", "entry" ), "\\t" ); Hfs tap = new Hfs( scheme, path ); String template = "%s-%s"; // dirs named "year-month" Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );GlobHfs extends MultiSourceTapThe cascading.tap.GlobHfs Tap accepts Hadoop style 'file globbing' expression patterns. This allows for multiple paths to be used as a single source, where all paths match the given pattern.Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.
SinkMode.KEEP This is the default behavior. If the resource exists, attempting to write to it will fail.SinkMode.REPLACE This allows Cascading to delete the file immediately after the Flow is started.SinkMode.UPDATE Allows for new Tap types that have the concept of update or append. For example, updating records in a database. It is up to the Tap to decide how to implement its "update" semantics. When Cascading sees the update mode, it knows not to attempt to delete the resource first or to not fail because it already exists.
Avro is a data serialization system.Avro provides functionality similar to systems such as Thrift, Protocol BuffersCascading.SimpleDB - Integration with Amazon SimpleDB.
It is not required that an Every follow either GroupBy or CoGroup, an Each may follow immediately after. But an Every many not follow an Each.For example : DISTINCTThe Each pipe may only apply Functions and Filters to the tuple stream as these operations may only operate on one Tuple at a time.The Every pipe may only apply Aggregators and Buffers to the tuple stream as these operations may only operate on groups of tuples, one grouping at a time.GroupBy supports ordering
Self joins supportedIn practice this would fail since the result Tuple has duplicate field names.A Mixed join is where 3 or more tuple streams are joined, and each pair must be joined differently. See the cascading.pipe.cogroup.MixedJoin class for more details.When joining two streams via a CoGroup Pipe, attempt to place the largest of the streams in the left most argument to the CoGroup. Joining multiple streams requires some accumulation of values before the join operator can begin, but the left most stream will not be accumulated. This should improve the performance of most joins.
Operation is a superclass of Function, Filter, Aggregator, Buffer, and Assertion. Function and Filter are each operationsAggregator and Buffer are every operationsUsually extends BaseOperation class
Identity FunctionDiscard unused fieldsRename all fieldsRename a single fieldDebugLevelenum values NONE,DEFAULT, or VERBOSEFlowConnector.setDebugLevel( properties, DebugLevel.NONE ); Sample The cascading.operation.filter.Sample filter allows a percentage of tuples to pass.Limit The cascading.operation.filter.Limit filter allows a set number of Tuples to pass.when some missing parameter or value, like a date String for the current date, needs to be inserted.Text FunctionsDateParserDateFormatterRegular Expression OperationsRegexParserRegexSplitterJava Expression OperationsExpressionFunctionExpressionFilterExpressionFilter filter = new ExpressionFilter( "status != 200", Integer.TYPE ); some characters will cause compilation errors
(Function, Filter,Aggregator, or Buffer) do not store operation state in class fields.For example, if implementing a custom 'counter' Aggregator, do not create a field named 'count' and increment it on every Aggregator.aggregate() call. There is no guarantee your Operation will be called from a single thread in a JVMThere is a context that you can record aggregation value. It’s the same ashadoop.
An Buffer may only be used with an Every pipe, and it may only follow a GroupBy or CoGroup pipe type.It differs by the fact that an Iterator is provided and it is the responsibility of the operate(cascading.flow.FlowProcess, BufferCall) method to iterate overall all the input arguments returned by this Iterator, if any. Header, footerdocument_id, term, term_count_in_document, total_terms_in_document
An Buffer may only be used with an Every pipe, and it may only follow a GroupBy or CoGroup pipe type.AggregateBy is a SubAssembly
Verifying input and output schemas before running flowStart() method is anasynchronized callA properties object can be set into FlowConnector, as you setHadoopjobconf
riffle is a lightweight Java library for executing collections of dependent processes as a single process. This library provides Java Annotations for tagging classes and methods supporting required life-cycle stages,import riffle.process.DependencyIncoming;import riffle.process.DependencyOutgoing;import riffle.process.ProcessCleanup;import riffle.process.ProcessComplete;import riffle.process.ProcessPrepare;import riffle.process.ProcessStart;import riffle.process.ProcessStop;
Assertions aren’t pipes.When running a tests against regression data, it makes sense to use strict assertions. This regression data should be small and represent many of the edge cases the processing assembly must support robustly. When running tests in staging, or with data that may vary in quality since it is from an unmanaged source, using validating assertions make much sense. Then there are obvious cases where assertions just get in the way and slow down processing and it would be nice to just bypass them.
Traps were not designed as a filtering mechanism
Since version 1.2Cascading does not support the so called MapReduce Combiners. Combiners are very powerful in that they reduce the IO between the Mappers and Reducers. Why send all your Mapper to data to Reducers when you can compute some values Map side and combine them in the Reducer. But Combiners are limited to Associative and Commutative functions only, like 'sum' and 'max'. And in order to work, values emitted from the Map task must be serialized, sorted (deserialized and compared), deserialized again and operated on, where again the results are serialized and sorted. Combiners trade CPU for gains in IO.Cascading takes a different approach by providing a mechanism to perform partial aggregations Map side and also combine them Reduce side. But Cascading chooses to trade Memory for IO gains by caching values (up to a threshold). This approach bypasses the unnecessary serialization, deserialization, and sorting steps. It also allows for any aggregate function to be implemented, not just Associative and Commutative ones.Class AggregateBy is a SubAssembly that serves two roles for handling aggregate operations. AverageBy, CountBy, SumBy
ClusterTestCase : MiniDFSCluster, MiniMRCluster, FileSystemFunctions : copyFromLocal, getFileSystem,getJobConf, getPropertiesLimit will get half records in version 1.1
Wireit supports firefox 3.5 above, it doesn’t work on firefox 3.0WireIt is released under the MIT License.