SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
The Next Generation of Data
Processing & Open Source
James Malone, Google Product Manager, Apache Beam PPMC
Eric Schmidt, Google Developer Relations
Agenda
1
2
3
4
5
6
The Last Generation - Common historical challenges in large-scale data processing
The Next Generation - How large-scale data processing should work
Apache Beam - A solution for next generation data processing
Why Beam matters - A gaming example to show the power of the Beam model
Demo - Lets run a Beam pipeline on 3 engines in 2 separate clouds
Things to Remember - Recap and how you can get involved
2
3
Common historical challenges in large-scale data processing
01 The Last Generation
Decide on tool Read docs
Get
infrastructure
Setup tools Tune tools
Productionize Get Specialists
Optimistic
Frustrated
Setting up infrastructure
Batch model
Streaming
model
Batch use case
Streaming use
case
Streaming
engine
Batch engine
Batch output
Streaming
output
Join output
Optimistic
Frustrated
Programming models
Data model
Data pipeline
Execution
engine 1
Data model
Data pipeline
Execution
engine 1
Data model
Data pipeline
Execution
engine 1
FrustratedHappy
Data pipeline portability
Infrastructure is a pain
Models are disconnected
Pipelines are not portable
7
8
How data processing should work
02 The Next Generation
9
Infrastructure is a pain an afterthought
Models are disconnected unified
Pipelines are not portable portable
Skim docs
Decide on
product
Start service
Optimistic
Happy
Setting up infrastructure
Unified model
Batch use case
Runner(s)
Streaming use
case
Output
Optimistic
Happy
A flexible (unified) model
Data model
Data pipeline
Execution
engine
Execution
engine
Execution
engine
Happy
Happier
Portable data pipelines
Why does this matter?
More time can be dedicated
to examining data for
actionable insights
Less time is spent wrangling
code, infrastructure, and
tools used to process data
Hands-on with data
Cloud setup and
customization
14
A solution for next generation data processing
03 Apache Beam (incubating)
What is Apache Beam?
1. The (unified stream + batch) Dataflow Beam programming model
2. Java and Python SDKs
3. Runners for Existing Distributed Processing Backends
a. Apache Flink (thanks to dataArtisans)
b. Apache Spark (thanks to Cloudera & PayPal)
c. Google Cloud Dataflow (fast, no-ops)
d. Local (in-process) runner for testing
+ Future runners for Beam - Apache Gearpump, Apache Apex, MapReduce, others!
15
The Apache Beam vision
1. End users: who want to write pipelines
in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a distributed
processing environment and want to
support Beam pipelines
16
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution
Joining several threads into Beam
17
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud
Dataflow
Cloud
Dataproc
Apache
Beam
Creating an Apache Beam community
Collaborate - Beam is becoming a community-driven
effort with participation from many organizations and
contributors
Grow - We want to grow the Beam ecosystem and
community with active, open involvement so beam is a
part of the larger OSS ecosystem
Learn - We (Google) are also learning a lot as this is
our first data-related Apache contribution ;-)
Apache Beam Roadmap
02/01/2016
Enter Apache
Incubator
End 2016
Beam pipelines
run on many
runners in
production uses
Early 2016
Design for use cases,
begin refactoring
Mid 2016
Additional refactoring,
non-production uses
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository
06/14/2016
1st incubating
release
June 2016
Python SDK
moves to
Beam
20
An example to show the power of the Beam model
04 Why Beam Matters
Apache Beam - A next generation model
21
Improved abstractions let you focus on
your business logic
Batch and stream processing are both
first-class citizens -- no need to choose.
Clearly separates event time from
processing time.
Processing time vs. event time
22
Beam model - asking the right questions
23
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
The Beam model - what is being computed?
24
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam model - what is being computed?
25
The Beam model - where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam model - where in event time?
The Beam model - when in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
The Beam model - when in processing time?
The Beam model - how do refinements relate?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam model - how do refinements relate?
Customizing what where when how
32
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
Apache Beam - the ecosystem
33http://beam.incubator.apache.org/capability-matrix
34
Lets run a Beam pipeline on 3 engines in 2 separate locations
05 Demo
35
Created 1 Beam pipeline
Ran that one pipeline on three execution engines in two places
● Google Cloud Platform
○ Google Cloud Dataflow
○ Apache Spark on Google Cloud Dataproc
● Local
○ Apache Beam local runner
○ Apache Flink
100% portability, 0 problems
What we just did
36
Recap and how you can get involved
06 Things to remember
Apache Beam is
designed to provide
potable pipelines
with a unified
programming
model 37
Get involved with Apache Beam
38
Apache Beam (incubating)
http://beam.incubator.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists!
user-subscribe@beam.incubator.apache.org
dev-subscribe@beam.incubator.apache.org
Join the Apache Beam Slack channel
https://apachebeam.slack.com
Follow @ApacheBeam on Twitter
A special thank you
39
A special thank you to Frances Perry and Tyler Akidau for sharing Apache
Beam content which was used in this presentation.
40
Thank you

Contenu connexe

Tendances

Streamline - Stream Analytics for Everyone
Streamline - Stream Analytics for EveryoneStreamline - Stream Analytics for Everyone
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 

Tendances (20)

Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
Streamline - Stream Analytics for Everyone
Streamline - Stream Analytics for EveryoneStreamline - Stream Analytics for Everyone
Streamline - Stream Analytics for Everyone
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Admiral Group
Admiral GroupAdmiral Group
Admiral Group
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Lego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming PipelinesLego-like building blocks of Storm and Spark Streaming Pipelines
Lego-like building blocks of Storm and Spark Streaming Pipelines
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data PlatformLessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United Airlines
 

En vedette

Introduction to Drools
Introduction to DroolsIntroduction to Drools
Introduction to Drools
giurca
 
Open source and business rules
Open source and business rulesOpen source and business rules
Open source and business rules
Geoffrey De Smet
 
Jboss drools 4 scope - benefits, shortfalls
Jboss drools   4 scope - benefits, shortfalls Jboss drools   4 scope - benefits, shortfalls
Jboss drools 4 scope - benefits, shortfalls
Zoran Hristov
 
Rules Programming tutorial
Rules Programming tutorialRules Programming tutorial
Rules Programming tutorial
Srinath Perera
 

En vedette (20)

Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Maximilian Michels - Flink and Beam
Maximilian Michels - Flink and BeamMaximilian Michels - Flink and Beam
Maximilian Michels - Flink and Beam
 
Introduction to Drools
Introduction to DroolsIntroduction to Drools
Introduction to Drools
 
Open source and business rules
Open source and business rulesOpen source and business rules
Open source and business rules
 
FOSS in the Enterprise
FOSS in the EnterpriseFOSS in the Enterprise
FOSS in the Enterprise
 
Jboss drools 4 scope - benefits, shortfalls
Jboss drools   4 scope - benefits, shortfalls Jboss drools   4 scope - benefits, shortfalls
Jboss drools 4 scope - benefits, shortfalls
 
Drools & jBPM Workshop London 2013
Drools & jBPM Workshop London 2013Drools & jBPM Workshop London 2013
Drools & jBPM Workshop London 2013
 
Drools BeJUG 2010
Drools BeJUG 2010Drools BeJUG 2010
Drools BeJUG 2010
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
ieeecloud2016
ieeecloud2016ieeecloud2016
ieeecloud2016
 
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + DemosDrools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
 
Drools & jBPM Info Sheet
Drools & jBPM Info SheetDrools & jBPM Info Sheet
Drools & jBPM Info Sheet
 
Intro to Drools - St Louis Gateway JUG
Intro to Drools - St Louis Gateway JUGIntro to Drools - St Louis Gateway JUG
Intro to Drools - St Louis Gateway JUG
 
Rules Programming tutorial
Rules Programming tutorialRules Programming tutorial
Rules Programming tutorial
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Apache Beam @ GCPUG.TW Flink.TW 20161006
Apache Beam @ GCPUG.TW Flink.TW 20161006Apache Beam @ GCPUG.TW Flink.TW 20161006
Apache Beam @ GCPUG.TW Flink.TW 20161006
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Drools
DroolsDrools
Drools
 

Similaire à The Next Generation of Data Processing and Open Source

Similaire à The Next Generation of Data Processing and Open Source (20)

BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
 
Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017Clipper at UC Berkeley RISECamp 2017
Clipper at UC Berkeley RISECamp 2017
 
Cloud Native Data Pipelines
Cloud Native Data PipelinesCloud Native Data Pipelines
Cloud Native Data Pipelines
 
Scaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamScaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache Beam
 
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 

Plus de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

The Next Generation of Data Processing and Open Source

  • 1. The Next Generation of Data Processing & Open Source James Malone, Google Product Manager, Apache Beam PPMC Eric Schmidt, Google Developer Relations
  • 2. Agenda 1 2 3 4 5 6 The Last Generation - Common historical challenges in large-scale data processing The Next Generation - How large-scale data processing should work Apache Beam - A solution for next generation data processing Why Beam matters - A gaming example to show the power of the Beam model Demo - Lets run a Beam pipeline on 3 engines in 2 separate clouds Things to Remember - Recap and how you can get involved 2
  • 3. 3 Common historical challenges in large-scale data processing 01 The Last Generation
  • 4. Decide on tool Read docs Get infrastructure Setup tools Tune tools Productionize Get Specialists Optimistic Frustrated Setting up infrastructure
  • 5. Batch model Streaming model Batch use case Streaming use case Streaming engine Batch engine Batch output Streaming output Join output Optimistic Frustrated Programming models
  • 6. Data model Data pipeline Execution engine 1 Data model Data pipeline Execution engine 1 Data model Data pipeline Execution engine 1 FrustratedHappy Data pipeline portability
  • 7. Infrastructure is a pain Models are disconnected Pipelines are not portable 7
  • 8. 8 How data processing should work 02 The Next Generation
  • 9. 9 Infrastructure is a pain an afterthought Models are disconnected unified Pipelines are not portable portable
  • 10. Skim docs Decide on product Start service Optimistic Happy Setting up infrastructure
  • 11. Unified model Batch use case Runner(s) Streaming use case Output Optimistic Happy A flexible (unified) model
  • 13. Why does this matter? More time can be dedicated to examining data for actionable insights Less time is spent wrangling code, infrastructure, and tools used to process data Hands-on with data Cloud setup and customization
  • 14. 14 A solution for next generation data processing 03 Apache Beam (incubating)
  • 15. What is Apache Beam? 1. The (unified stream + batch) Dataflow Beam programming model 2. Java and Python SDKs 3. Runners for Existing Distributed Processing Backends a. Apache Flink (thanks to dataArtisans) b. Apache Spark (thanks to Cloudera & PayPal) c. Google Cloud Dataflow (fast, no-ops) d. Local (in-process) runner for testing + Future runners for Beam - Apache Gearpump, Apache Apex, MapReduce, others! 15
  • 16. The Apache Beam vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines 16 Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  • 17. Joining several threads into Beam 17 MapReduce BigTable DremelColossus FlumeMegastore SpannerPubSub Millwheel Cloud Dataflow Cloud Dataproc Apache Beam
  • 18. Creating an Apache Beam community Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors Grow - We want to grow the Beam ecosystem and community with active, open involvement so beam is a part of the larger OSS ecosystem Learn - We (Google) are also learning a lot as this is our first data-related Apache contribution ;-)
  • 19. Apache Beam Roadmap 02/01/2016 Enter Apache Incubator End 2016 Beam pipelines run on many runners in production uses Early 2016 Design for use cases, begin refactoring Mid 2016 Additional refactoring, non-production uses Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository 06/14/2016 1st incubating release June 2016 Python SDK moves to Beam
  • 20. 20 An example to show the power of the Beam model 04 Why Beam Matters
  • 21. Apache Beam - A next generation model 21 Improved abstractions let you focus on your business logic Batch and stream processing are both first-class citizens -- no need to choose. Clearly separates event time from processing time.
  • 22. Processing time vs. event time 22
  • 23. Beam model - asking the right questions 23 What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 24. The Beam model - what is being computed? 24 PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey());
  • 25. The Beam model - what is being computed? 25
  • 26. The Beam model - where in event time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey());
  • 27. The Beam model - where in event time?
  • 28. The Beam model - when in processing time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
  • 29. The Beam model - when in processing time?
  • 30. The Beam model - how do refinements relate? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey());
  • 31. The Beam model - how do refinements relate?
  • 32. Customizing what where when how 32 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 33. Apache Beam - the ecosystem 33http://beam.incubator.apache.org/capability-matrix
  • 34. 34 Lets run a Beam pipeline on 3 engines in 2 separate locations 05 Demo
  • 35. 35 Created 1 Beam pipeline Ran that one pipeline on three execution engines in two places ● Google Cloud Platform ○ Google Cloud Dataflow ○ Apache Spark on Google Cloud Dataproc ● Local ○ Apache Beam local runner ○ Apache Flink 100% portability, 0 problems What we just did
  • 36. 36 Recap and how you can get involved 06 Things to remember
  • 37. Apache Beam is designed to provide potable pipelines with a unified programming model 37
  • 38. Get involved with Apache Beam 38 Apache Beam (incubating) http://beam.incubator.apache.org The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Join the Beam mailing lists! user-subscribe@beam.incubator.apache.org dev-subscribe@beam.incubator.apache.org Join the Apache Beam Slack channel https://apachebeam.slack.com Follow @ApacheBeam on Twitter
  • 39. A special thank you 39 A special thank you to Frances Perry and Tyler Akidau for sharing Apache Beam content which was used in this presentation.