SlideShare une entreprise Scribd logo
1  sur  16
Cascading
 Nathan Marz
  BackType
What is Cascading?

Cascading is a Java library that makes development of
   complex Hadoop MapReduce workflows easy
Why Hadoop?


• Process large amounts of data in a scalable,
  fault-tolerant way
Why Cascading?
    Tool           How you feel


Hadoop MapReduce




  Cascading
Tuples
Cascading represents all data as “Tuples”

       (“the man sat” , 25)
       (“hello dolly”  , 42)
       (“say hello”    ,1 )
       (“the woman sat”, 10)
Tuples
Tuples are named, ordered fields

     [“sentence”, “value”]
     (“the man sat” , 25)
     (“hello dolly”  , 42)
     (“say hello”    ,1 )
     (“the woman sat”, 10)
Flow
  A flow is a sequence of manipulations on
           pipes of tuple streams
• Flow compiles to one or more MapReduce
  jobs
• Inputs and outputs called “Taps”.
• Each Tap produces or receives a pipe of
  tuples with the same format
• Multiple inputs, multiple outputs
Example

[“sentence”, “value”]         [“word”, “sum”]



      Get the sum of the values for each word
Example
  [“sentence”, “value”]
               Split(“sentence”) -> “word”
   [“word”, “value”]
               GroupBy(“word”)
[“word”, list<[“value”]>]
              Sum(“value”) -> “sum”

     [“word”, “sum”]
Example
             Split(“sentence”) -> “word”

[“sentence”, “value”]          [“word”, “value”]
                               (“the”   , 25)
(“the man sat” , 25)           (“man” , 25)
(“hello dolly”  , 42)          (“sat”    , 25)
(“say hello”    ,1 )           (“hello” , 42)
(“the woman sat”, 10)          (“dolly” , 42)
                               (“say”     ,1 )
                               (“hello” , 1 )
                               (“the”    , 10)
                               (“woman” , 10)
                               (“sat”     , 10)
Example
                   GroupBy(“word”)

[“word”, “value”]            [“word”, list<[“value”]>]
(“the”   , 25)
(“man” , 25)                  (“the”   , [25, 10])
(“sat”    , 25)               (“man” , [25]       )
(“hello” , 42)                (“sat”    , [25, 10])
(“dolly” , 42)                (“hello” , [42, 1] )
(“say”     ,1 )               (“dolly” , [42]      )
(“hello” , 1 )                (“say”     , [1]    )
(“the”    , 10)               (“woman” , [10]     )
(“woman” , 10)
(“sat”     , 10)
Example
                Sum(“value”) -> “sum”

[“word”, list<[“value”]>]        [“word”, “sum”]

(“the”   , [25, 10])          (“the”   , 35)
(“man” , [25]       )         (“man” , 25)
(“sat”    , [25, 10])         (“sat”    , 35)
(“hello” , [42, 1] )          (“hello” , 43)
(“dolly” , [42]      )        (“dolly” , 42)
(“say”     , [1]    )         (“say”     ,1 )
(“woman” , [10]     )         (“woman” , 10)
More functionality

• Inner and outer joins natively supported
• Seamlessly branch and merge pipes of
  tuples
• Integrate diverse data sources
Why not Pig?

• Pig is a custom language for writing
  MapReduce workflows
• Because it’s a custom language, intermixing
  “plain logic” in between flows is painful
• Not nearly as flexible as Cascading for
  custom needs
Learn more


• Tutorial: http://blog.rapleaf.com/dev/?p=33
• Website: http://www.cascading.org
Questions?

Contenu connexe

En vedette

Lab safety 12_10_13
Lab safety 12_10_13Lab safety 12_10_13
Lab safety 12_10_13skwahl
 
Animales en peligro de extincion
Animales en peligro de extincionAnimales en peligro de extincion
Animales en peligro de extincionlosdonkey
 
I love free_nsta2010
I love free_nsta2010I love free_nsta2010
I love free_nsta2010Jan Coley
 
Periodismo chiquinquireño
Periodismo chiquinquireñoPeriodismo chiquinquireño
Periodismo chiquinquireñoErikaSeb
 
Ahead Week 1 Key Slides
Ahead Week 1 Key SlidesAhead Week 1 Key Slides
Ahead Week 1 Key Slidesaltonbaird
 
A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...Raphael Troncy
 
02 epidemio enf reum
02 epidemio enf reum02 epidemio enf reum
02 epidemio enf reumiloaeza_89
 
Wakefield customer insight project
Wakefield customer insight projectWakefield customer insight project
Wakefield customer insight projectlocalinsight
 
Setting up Your LinkedIn Account
Setting up Your LinkedIn AccountSetting up Your LinkedIn Account
Setting up Your LinkedIn AccountNET:101
 
Aprendiendo sobre las emociones de los pacientes mediante obras artísticas
Aprendiendo sobre las emociones de los pacientes mediante obras artísticasAprendiendo sobre las emociones de los pacientes mediante obras artísticas
Aprendiendo sobre las emociones de los pacientes mediante obras artísticasRafa Cofiño
 

En vedette (17)

Lab safety 12_10_13
Lab safety 12_10_13Lab safety 12_10_13
Lab safety 12_10_13
 
Animales en peligro de extincion
Animales en peligro de extincionAnimales en peligro de extincion
Animales en peligro de extincion
 
I love free_nsta2010
I love free_nsta2010I love free_nsta2010
I love free_nsta2010
 
Periodismo chiquinquireño
Periodismo chiquinquireñoPeriodismo chiquinquireño
Periodismo chiquinquireño
 
Ahead Week 1 Key Slides
Ahead Week 1 Key SlidesAhead Week 1 Key Slides
Ahead Week 1 Key Slides
 
Chistesvarios8
Chistesvarios8Chistesvarios8
Chistesvarios8
 
A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...
 
Social media ROI
Social media ROISocial media ROI
Social media ROI
 
02 epidemio enf reum
02 epidemio enf reum02 epidemio enf reum
02 epidemio enf reum
 
Wakefield customer insight project
Wakefield customer insight projectWakefield customer insight project
Wakefield customer insight project
 
PNUTS
PNUTSPNUTS
PNUTS
 
certificate
certificatecertificate
certificate
 
Setting up Your LinkedIn Account
Setting up Your LinkedIn AccountSetting up Your LinkedIn Account
Setting up Your LinkedIn Account
 
Power tecnologia
Power tecnologiaPower tecnologia
Power tecnologia
 
Aprendiendo sobre las emociones de los pacientes mediante obras artísticas
Aprendiendo sobre las emociones de los pacientes mediante obras artísticasAprendiendo sobre las emociones de los pacientes mediante obras artísticas
Aprendiendo sobre las emociones de los pacientes mediante obras artísticas
 
Dr. Bart Cammaerts - The Mediation of Dissensus
Dr. Bart Cammaerts - The Mediation of DissensusDr. Bart Cammaerts - The Mediation of Dissensus
Dr. Bart Cammaerts - The Mediation of Dissensus
 
Presentasi moment
Presentasi momentPresentasi moment
Presentasi moment
 

Plus de nathanmarz

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processingnathanmarz
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineeringnathanmarz
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrongnathanmarz
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itnathanmarz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypenathanmarz
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systemsnathanmarz
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackTypenathanmarz
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshopnathanmarz
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loopnathanmarz
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Daynathanmarz
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Groupnathanmarz
 

Plus de nathanmarz (17)

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrong
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
Storm
StormStorm
Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackType
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackType
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshop
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Day
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Cascalog
CascalogCascalog
Cascalog
 

Dernier

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Cascading

  • 2. What is Cascading? Cascading is a Java library that makes development of complex Hadoop MapReduce workflows easy
  • 3. Why Hadoop? • Process large amounts of data in a scalable, fault-tolerant way
  • 4. Why Cascading? Tool How you feel Hadoop MapReduce Cascading
  • 5. Tuples Cascading represents all data as “Tuples” (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
  • 6. Tuples Tuples are named, ordered fields [“sentence”, “value”] (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
  • 7. Flow A flow is a sequence of manipulations on pipes of tuple streams • Flow compiles to one or more MapReduce jobs • Inputs and outputs called “Taps”. • Each Tap produces or receives a pipe of tuples with the same format • Multiple inputs, multiple outputs
  • 8. Example [“sentence”, “value”] [“word”, “sum”] Get the sum of the values for each word
  • 9. Example [“sentence”, “value”] Split(“sentence”) -> “word” [“word”, “value”] GroupBy(“word”) [“word”, list<[“value”]>] Sum(“value”) -> “sum” [“word”, “sum”]
  • 10. Example Split(“sentence”) -> “word” [“sentence”, “value”] [“word”, “value”] (“the” , 25) (“the man sat” , 25) (“man” , 25) (“hello dolly” , 42) (“sat” , 25) (“say hello” ,1 ) (“hello” , 42) (“the woman sat”, 10) (“dolly” , 42) (“say” ,1 ) (“hello” , 1 ) (“the” , 10) (“woman” , 10) (“sat” , 10)
  • 11. Example GroupBy(“word”) [“word”, “value”] [“word”, list<[“value”]>] (“the” , 25) (“man” , 25) (“the” , [25, 10]) (“sat” , 25) (“man” , [25] ) (“hello” , 42) (“sat” , [25, 10]) (“dolly” , 42) (“hello” , [42, 1] ) (“say” ,1 ) (“dolly” , [42] ) (“hello” , 1 ) (“say” , [1] ) (“the” , 10) (“woman” , [10] ) (“woman” , 10) (“sat” , 10)
  • 12. Example Sum(“value”) -> “sum” [“word”, list<[“value”]>] [“word”, “sum”] (“the” , [25, 10]) (“the” , 35) (“man” , [25] ) (“man” , 25) (“sat” , [25, 10]) (“sat” , 35) (“hello” , [42, 1] ) (“hello” , 43) (“dolly” , [42] ) (“dolly” , 42) (“say” , [1] ) (“say” ,1 ) (“woman” , [10] ) (“woman” , 10)
  • 13. More functionality • Inner and outer joins natively supported • Seamlessly branch and merge pipes of tuples • Integrate diverse data sources
  • 14. Why not Pig? • Pig is a custom language for writing MapReduce workflows • Because it’s a custom language, intermixing “plain logic” in between flows is painful • Not nearly as flexible as Cascading for custom needs
  • 15. Learn more • Tutorial: http://blog.rapleaf.com/dev/?p=33 • Website: http://www.cascading.org

Notes de l'éditeur