SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
“Enterprise Data
Workflows with
Cascading”
• Cascading is an application framework for developers, data analysts, data scientists to
simply develop robust Data Analytics and Data Management applications on Apache
Hadoop.
• Cascading is a query API and query Planner used for defining, sharing, and executing
data processing workflows on a distributed data grid or cluster, based upon Apache
Hadoop;
• Cascading greatly simplifies the complexities with Hadoop application development, job
creation, and job scheduling;
• Cascading was developed to allow organizations to rapidly develop complex data
processing applications;
Cascading API
Tap
• LFS
• DFS
• HFS
• MultiSourceTap
• MultiSinkTap
• TemplateTap
• GlobHfs
• S3fs(Deprecated)
Scheme
• TextLine
• TextDelimited
• SequenceFile
• WritableSequenceFile
TAP types
• SinkMode.KEEP
• SinkMode.REPLACE
• SinkMode.UPDATE
Tuple
• A single ‘row’ of data being processed
• Each column is named
• Can access data by name or position
How to create Tap?
Fields fields=new Fields(“field_1”, ”field_2, ”...”, “field_n”);
TextDelimited scheme = new TextDelimited(fields, "t");
Hfs input = new Hfs(scheme, path/to/hdfs_file, SinkMode.REPLACE );
Pipe Assemblies
• Pipe assemblies define what work should be done against a tuple stream, where during runtime
tuple streams are read from Tap sources and are written to Tap sinks.
• Pipe assemblies may have multiple sources and multiple sinks and they can define splits,
merges, and joins to manipulate how the tuple streams interact.
Assemblies
• Pipe
• Each
• GroupBy
• CoGroup
• Every
• SubAssembly
Function
[Applicable to Each operation]
• Identity Function
• Debug Function
• Sample and Limit Functions
• Insert Function
• Text Functions
• Regular Expression Operations
• Java Expression Operations
• "first-name" is a valid field name for use with Cascading, but this
expression, first-name.trim(), will fail.
• Custom Functions
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source,
sink, pipe );
flow.complete();
Flow flow = new FlowConnector(new Properties()).connect( "flow-name",
source,
sink, pipe );
flow.complete();
Filter
[Applicable to Each operation]
• And
• Or
• Not
• Xor
• NotNull
• Null
• RegexFilter
• Custom Filter
Aggregator
[Applicable to Every operation]
• Average
• Count
• First
• Last
• Max
• Min
• Sum
• Custom Aggregator
Buffer
• It is very similar to the typical Reducer interface
• It is very useful when header or footer values need to be inserted into a
grouping, or if values need to be inserted into the middle of the group
values
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );
flow.complete();
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink,
pipe );
flow.complete();
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source,
sink, pipe );
flow.complete();
CoGroup
• InnerJoin
• OuterJoin
• LeftJoin
• RightJoin
• MixedJoin
Flow
• To create a Flow, it must be planned though the FlowConnector object.
• The connect() method is used to create new Flow instances based on a set
of sink Taps, source Taps, and a pipe assembly.
Cascades
• Groups of Flow are called Cascades
• Custom MapReduce jobs can participate in Cascade
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );
flow.complete();
CascadeConnector cascadeConnector=new CascadeConnector();
Cascade cascade = cascadeConnector.connect(flow1, flow2, flow3);
cascade.complete();
LHS = [0,a] [1,b] [2,c] RHS = [0,A] [2,C] [3,D]
InnerJoin: [0,a,0,A] [2,c,2,C]
OuterJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C] [null,null,3,D]
LeftJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C]
RightJoin: [0,a,0,A] [2,c,2,C] [null,null,3,D]
Flow Connector
• HadoopFlowConnector
• LocalFlowConnector
Monitoring
Implement FlowListener interface
• onStarting
• onStopping
• onCompleted
• onThrowable
Testing
• Use ClusterTestCase if you want to launch an embedded Hadoop cluster
inside your TestCase
• A few validation and hadoop functions are provided
• Doesn’t support Hadoop 0.21 testing library
• CascadingTestCase with JUnit
Flow flow = new FlowConnector(new Properties()).connect( "flow-name",
source, sink, pipe );
flow.complete();
Field Algebra
Fields sets are constant values on the Fields class and can be used in many places the Fields class
is expected. They are:
• Fields.ALL
• Fields.RESULTS
• Fields.REPLACE
• Fields.SWAP
• Fields.ARGS
• Fields.GROUP
• Fields.VALUES
• Fields.UNKNOWN
Pipe assembly = new Each( assembly, Fields.ALL, function ,Fields.RESULTS);
Word Counting the Hadoop Way with Cascading
As “word counting” has become the customary “Hello, World!” application for those programmers
who are new to Hadoop and Map-Reduce paradigm, let’s have a look at an example in Cascading
Hadoop:
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex operation to split the "document" text lines into a token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Just a few notes worth mentioning about the overall logic adopted in the preceding code listing:
• the code creates some Pipes objects for the input/output of data, then uses a
“RegexSplitGenerator” (that implements a regex to split the text on word boundaries) inside
an iterator (the “Each” object), that returns another Pipe (the “docPipe”) to split the document
text into a token stream.
• A “GroupBy” is defined to count the occurrences of each token, and then the pipes are
connected together, like in a “cascade”, via the “flowDef” object.
• Finally, a DOT file is generated, to depict the Cascading flow graphically.
The DOT file can be loaded into OmniGraffle or Visio, and is really helpful for troubleshooting
MapReduce workflows in Cascading.
??
http://docs.cascading.org/casca
ding/1.2/userguide/htmlsingle/
END!!!

Contenu connexe

Tendances

Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Flume HBase
Flume HBaseFlume HBase
Flume HBaseirayan
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling WaterSri Ambati
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water MeetupSri Ambati
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_onSri Ambati
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Yonik Seeley
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
 

Tendances (20)

Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
Presto overview
Presto overviewPresto overview
Presto overview
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Stress test data pipeline
Stress test data pipelineStress test data pipeline
Stress test data pipeline
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Flume HBase
Flume HBaseFlume HBase
Flume HBase
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 

Similaire à Data Processing with Cascading Java API on Apache Hadoop

The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application frameworkModern Data Stack France
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introductionAlex Su
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To CascadingNate Murray
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & developmentShashwat Shriparv
 
Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...DataWorks Summit
 
CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)Ortus Solutions, Corp
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...Ortus Solutions, Corp
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014cwensel
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
 
Getting to know Laravel 5
Getting to know Laravel 5Getting to know Laravel 5
Getting to know Laravel 5Bukhori Aqid
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaBack-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaAkara Sucharitakul
 

Similaire à Data Processing with Cascading Java API on Apache Hadoop (20)

The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application framework
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introduction
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Getting to know Laravel 5
Getting to know Laravel 5Getting to know Laravel 5
Getting to know Laravel 5
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaBack-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 

Dernier

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 

Dernier (20)

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 

Data Processing with Cascading Java API on Apache Hadoop

  • 1. “Enterprise Data Workflows with Cascading” • Cascading is an application framework for developers, data analysts, data scientists to simply develop robust Data Analytics and Data Management applications on Apache Hadoop. • Cascading is a query API and query Planner used for defining, sharing, and executing data processing workflows on a distributed data grid or cluster, based upon Apache Hadoop; • Cascading greatly simplifies the complexities with Hadoop application development, job creation, and job scheduling; • Cascading was developed to allow organizations to rapidly develop complex data processing applications;
  • 2. Cascading API Tap • LFS • DFS • HFS • MultiSourceTap • MultiSinkTap • TemplateTap • GlobHfs • S3fs(Deprecated) Scheme • TextLine • TextDelimited • SequenceFile • WritableSequenceFile TAP types • SinkMode.KEEP • SinkMode.REPLACE • SinkMode.UPDATE Tuple • A single ‘row’ of data being processed • Each column is named • Can access data by name or position How to create Tap? Fields fields=new Fields(“field_1”, ”field_2, ”...”, “field_n”); TextDelimited scheme = new TextDelimited(fields, "t"); Hfs input = new Hfs(scheme, path/to/hdfs_file, SinkMode.REPLACE );
  • 3. Pipe Assemblies • Pipe assemblies define what work should be done against a tuple stream, where during runtime tuple streams are read from Tap sources and are written to Tap sinks. • Pipe assemblies may have multiple sources and multiple sinks and they can define splits, merges, and joins to manipulate how the tuple streams interact. Assemblies • Pipe • Each • GroupBy • CoGroup • Every • SubAssembly Function [Applicable to Each operation] • Identity Function • Debug Function • Sample and Limit Functions • Insert Function • Text Functions • Regular Expression Operations • Java Expression Operations • "first-name" is a valid field name for use with Cascading, but this expression, first-name.trim(), will fail. • Custom Functions Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe ); flow.complete(); Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe ); flow.complete();
  • 4. Filter [Applicable to Each operation] • And • Or • Not • Xor • NotNull • Null • RegexFilter • Custom Filter Aggregator [Applicable to Every operation] • Average • Count • First • Last • Max • Min • Sum • Custom Aggregator Buffer • It is very similar to the typical Reducer interface • It is very useful when header or footer values need to be inserted into a grouping, or if values need to be inserted into the middle of the group values Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe ); flow.complete(); Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe ); flow.complete(); Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe ); flow.complete();
  • 5. CoGroup • InnerJoin • OuterJoin • LeftJoin • RightJoin • MixedJoin Flow • To create a Flow, it must be planned though the FlowConnector object. • The connect() method is used to create new Flow instances based on a set of sink Taps, source Taps, and a pipe assembly. Cascades • Groups of Flow are called Cascades • Custom MapReduce jobs can participate in Cascade Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe ); flow.complete(); CascadeConnector cascadeConnector=new CascadeConnector(); Cascade cascade = cascadeConnector.connect(flow1, flow2, flow3); cascade.complete(); LHS = [0,a] [1,b] [2,c] RHS = [0,A] [2,C] [3,D] InnerJoin: [0,a,0,A] [2,c,2,C] OuterJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C] [null,null,3,D] LeftJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C] RightJoin: [0,a,0,A] [2,c,2,C] [null,null,3,D]
  • 6. Flow Connector • HadoopFlowConnector • LocalFlowConnector Monitoring Implement FlowListener interface • onStarting • onStopping • onCompleted • onThrowable Testing • Use ClusterTestCase if you want to launch an embedded Hadoop cluster inside your TestCase • A few validation and hadoop functions are provided • Doesn’t support Hadoop 0.21 testing library • CascadingTestCase with JUnit Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe ); flow.complete();
  • 7. Field Algebra Fields sets are constant values on the Fields class and can be used in many places the Fields class is expected. They are: • Fields.ALL • Fields.RESULTS • Fields.REPLACE • Fields.SWAP • Fields.ARGS • Fields.GROUP • Fields.VALUES • Fields.UNKNOWN Pipe assembly = new Each( assembly, Fields.ALL, function ,Fields.RESULTS);
  • 8. Word Counting the Hadoop Way with Cascading As “word counting” has become the customary “Hello, World!” application for those programmers who are new to Hadoop and Map-Reduce paradigm, let’s have a look at an example in Cascading Hadoop: HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();
  • 9. Just a few notes worth mentioning about the overall logic adopted in the preceding code listing: • the code creates some Pipes objects for the input/output of data, then uses a “RegexSplitGenerator” (that implements a regex to split the text on word boundaries) inside an iterator (the “Each” object), that returns another Pipe (the “docPipe”) to split the document text into a token stream. • A “GroupBy” is defined to count the occurrences of each token, and then the pipes are connected together, like in a “cascade”, via the “flowDef” object. • Finally, a DOT file is generated, to depict the Cascading flow graphically. The DOT file can be loaded into OmniGraffle or Visio, and is really helpful for troubleshooting MapReduce workflows in Cascading.
  • 10. ??