SlideShare une entreprise Scribd logo
1  sur  16
Cascading
 Nathan Marz
  BackType
What is Cascading?

Cascading is a Java library that makes development of
   complex Hadoop MapReduce workflows easy
Why Hadoop?


• Process large amounts of data in a scalable,
  fault-tolerant way
Why Cascading?
    Tool           How you feel


Hadoop MapReduce




  Cascading
Tuples
Cascading represents all data as “Tuples”

       (“the man sat” , 25)
       (“hello dolly”  , 42)
       (“say hello”    ,1 )
       (“the woman sat”, 10)
Tuples
Tuples are named, ordered fields

     [“sentence”, “value”]
     (“the man sat” , 25)
     (“hello dolly”  , 42)
     (“say hello”    ,1 )
     (“the woman sat”, 10)
Flow
  A flow is a sequence of manipulations on
           pipes of tuple streams
• Flow compiles to one or more MapReduce
  jobs
• Inputs and outputs called “Taps”.
• Each Tap produces or receives a pipe of
  tuples with the same format
• Multiple inputs, multiple outputs
Example

[“sentence”, “value”]         [“word”, “sum”]



      Get the sum of the values for each word
Example
  [“sentence”, “value”]
               Split(“sentence”) -> “word”
   [“word”, “value”]
               GroupBy(“word”)
[“word”, list<[“value”]>]
              Sum(“value”) -> “sum”

     [“word”, “sum”]
Example
             Split(“sentence”) -> “word”

[“sentence”, “value”]          [“word”, “value”]
                               (“the”   , 25)
(“the man sat” , 25)           (“man” , 25)
(“hello dolly”  , 42)          (“sat”    , 25)
(“say hello”    ,1 )           (“hello” , 42)
(“the woman sat”, 10)          (“dolly” , 42)
                               (“say”     ,1 )
                               (“hello” , 1 )
                               (“the”    , 10)
                               (“woman” , 10)
                               (“sat”     , 10)
Example
                   GroupBy(“word”)

[“word”, “value”]            [“word”, list<[“value”]>]
(“the”   , 25)
(“man” , 25)                  (“the”   , [25, 10])
(“sat”    , 25)               (“man” , [25]       )
(“hello” , 42)                (“sat”    , [25, 10])
(“dolly” , 42)                (“hello” , [42, 1] )
(“say”     ,1 )               (“dolly” , [42]      )
(“hello” , 1 )                (“say”     , [1]    )
(“the”    , 10)               (“woman” , [10]     )
(“woman” , 10)
(“sat”     , 10)
Example
                Sum(“value”) -> “sum”

[“word”, list<[“value”]>]        [“word”, “sum”]

(“the”   , [25, 10])          (“the”   , 35)
(“man” , [25]       )         (“man” , 25)
(“sat”    , [25, 10])         (“sat”    , 35)
(“hello” , [42, 1] )          (“hello” , 43)
(“dolly” , [42]      )        (“dolly” , 42)
(“say”     , [1]    )         (“say”     ,1 )
(“woman” , [10]     )         (“woman” , 10)
More functionality

• Inner and outer joins natively supported
• Seamlessly branch and merge pipes of
  tuples
• Integrate diverse data sources
Why not Pig?

• Pig is a custom language for writing
  MapReduce workflows
• Because it’s a custom language, intermixing
  “plain logic” in between flows is painful
• Not nearly as flexible as Cascading for
  custom needs
Learn more


• Tutorial: http://blog.rapleaf.com/dev/?p=33
• Website: http://www.cascading.org
Questions?

Contenu connexe

En vedette

Animales en peligro de extincion
Animales en peligro de extincionAnimales en peligro de extincion
Animales en peligro de extincion
losdonkey
 
Ahead Week 1 Key Slides
Ahead Week 1 Key SlidesAhead Week 1 Key Slides
Ahead Week 1 Key Slides
altonbaird
 
02 epidemio enf reum
02 epidemio enf reum02 epidemio enf reum
02 epidemio enf reum
iloaeza_89
 
Wakefield customer insight project
Wakefield customer insight projectWakefield customer insight project
Wakefield customer insight project
localinsight
 
Setting up Your LinkedIn Account
Setting up Your LinkedIn AccountSetting up Your LinkedIn Account
Setting up Your LinkedIn Account
NET:101
 

En vedette (17)

Lab safety 12_10_13
Lab safety 12_10_13Lab safety 12_10_13
Lab safety 12_10_13
 
Animales en peligro de extincion
Animales en peligro de extincionAnimales en peligro de extincion
Animales en peligro de extincion
 
I love free_nsta2010
I love free_nsta2010I love free_nsta2010
I love free_nsta2010
 
Periodismo chiquinquireño
Periodismo chiquinquireñoPeriodismo chiquinquireño
Periodismo chiquinquireño
 
Ahead Week 1 Key Slides
Ahead Week 1 Key SlidesAhead Week 1 Key Slides
Ahead Week 1 Key Slides
 
Chistesvarios8
Chistesvarios8Chistesvarios8
Chistesvarios8
 
A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...
 
Social media ROI
Social media ROISocial media ROI
Social media ROI
 
02 epidemio enf reum
02 epidemio enf reum02 epidemio enf reum
02 epidemio enf reum
 
Wakefield customer insight project
Wakefield customer insight projectWakefield customer insight project
Wakefield customer insight project
 
PNUTS
PNUTSPNUTS
PNUTS
 
certificate
certificatecertificate
certificate
 
Setting up Your LinkedIn Account
Setting up Your LinkedIn AccountSetting up Your LinkedIn Account
Setting up Your LinkedIn Account
 
Power tecnologia
Power tecnologiaPower tecnologia
Power tecnologia
 
Aprendiendo sobre las emociones de los pacientes mediante obras artísticas
Aprendiendo sobre las emociones de los pacientes mediante obras artísticasAprendiendo sobre las emociones de los pacientes mediante obras artísticas
Aprendiendo sobre las emociones de los pacientes mediante obras artísticas
 
Dr. Bart Cammaerts - The Mediation of Dissensus
Dr. Bart Cammaerts - The Mediation of DissensusDr. Bart Cammaerts - The Mediation of Dissensus
Dr. Bart Cammaerts - The Mediation of Dissensus
 
Presentasi moment
Presentasi momentPresentasi moment
Presentasi moment
 

Plus de nathanmarz

Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
nathanmarz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 

Plus de nathanmarz (17)

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrong
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
Storm
StormStorm
Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackType
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackType
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshop
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Day
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Cascalog
CascalogCascalog
Cascalog
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Cascading

  • 2. What is Cascading? Cascading is a Java library that makes development of complex Hadoop MapReduce workflows easy
  • 3. Why Hadoop? • Process large amounts of data in a scalable, fault-tolerant way
  • 4. Why Cascading? Tool How you feel Hadoop MapReduce Cascading
  • 5. Tuples Cascading represents all data as “Tuples” (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
  • 6. Tuples Tuples are named, ordered fields [“sentence”, “value”] (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
  • 7. Flow A flow is a sequence of manipulations on pipes of tuple streams • Flow compiles to one or more MapReduce jobs • Inputs and outputs called “Taps”. • Each Tap produces or receives a pipe of tuples with the same format • Multiple inputs, multiple outputs
  • 8. Example [“sentence”, “value”] [“word”, “sum”] Get the sum of the values for each word
  • 9. Example [“sentence”, “value”] Split(“sentence”) -> “word” [“word”, “value”] GroupBy(“word”) [“word”, list<[“value”]>] Sum(“value”) -> “sum” [“word”, “sum”]
  • 10. Example Split(“sentence”) -> “word” [“sentence”, “value”] [“word”, “value”] (“the” , 25) (“the man sat” , 25) (“man” , 25) (“hello dolly” , 42) (“sat” , 25) (“say hello” ,1 ) (“hello” , 42) (“the woman sat”, 10) (“dolly” , 42) (“say” ,1 ) (“hello” , 1 ) (“the” , 10) (“woman” , 10) (“sat” , 10)
  • 11. Example GroupBy(“word”) [“word”, “value”] [“word”, list<[“value”]>] (“the” , 25) (“man” , 25) (“the” , [25, 10]) (“sat” , 25) (“man” , [25] ) (“hello” , 42) (“sat” , [25, 10]) (“dolly” , 42) (“hello” , [42, 1] ) (“say” ,1 ) (“dolly” , [42] ) (“hello” , 1 ) (“say” , [1] ) (“the” , 10) (“woman” , [10] ) (“woman” , 10) (“sat” , 10)
  • 12. Example Sum(“value”) -> “sum” [“word”, list<[“value”]>] [“word”, “sum”] (“the” , [25, 10]) (“the” , 35) (“man” , [25] ) (“man” , 25) (“sat” , [25, 10]) (“sat” , 35) (“hello” , [42, 1] ) (“hello” , 43) (“dolly” , [42] ) (“dolly” , 42) (“say” , [1] ) (“say” ,1 ) (“woman” , [10] ) (“woman” , 10)
  • 13. More functionality • Inner and outer joins natively supported • Seamlessly branch and merge pipes of tuples • Integrate diverse data sources
  • 14. Why not Pig? • Pig is a custom language for writing MapReduce workflows • Because it’s a custom language, intermixing “plain logic” in between flows is painful • Not nearly as flexible as Cascading for custom needs
  • 15. Learn more • Tutorial: http://blog.rapleaf.com/dev/?p=33 • Website: http://www.cascading.org

Notes de l'éditeur