SlideShare une entreprise Scribd logo
1  sur  19
Welcome
Chicago Data Engineering Meetup
- Our First Event – November 2018
- Objectives
- Every 2 months
- Format
- sharing experiences (open for volunteers)
- new tools / demos
- Open for suggestions
01 Who I am
02 QuantumBlack
03 Today’s topic: Spark UDF Performance
04 Background
05 Benchmarking – Live demo
06 Conclusion and Our Approach
07 Q&A
Agenda
Who I am
01
4All content copyright © 2017 QuantumBlack, a McKinsey company
Client case studies
Experience across several industry sectors,
including telecoms, retail, financial services and
pharmaceuticals.
Financial sector – Advanced Analytics
projects for Fraud detection in Internet Banking
and Credit Risk Modelling.
Telecommunications – Petabyte scale
environment, delivering several use cases,
including: real-time failure detection using CDR
data, customer profiling and marketing
campaigns.
Manufacturing– data wrangling in failure
detection project for computer parts
manufacturing in Europe.
Pharmaceuticals – Site selection optimisation
for a top pharma players.
Telematics (Car insurance) – machine learning
model that estimates the probability of crashing
for each driver based data obtained from on
board units box installed on cars containing
geo-location positions, speed and acceleration
of ~2 million drivers over a 2-year period.
Complex feature creation using terabyte scale
and external data sources such as weather,
street and traffic data.
Education
Guilherme has a BSc in Data Processing from
Mackenzie University and specialisations in
Machine Learning and Business Intelligence.
Role
Big Data technology expert based in Chicago.
Work with clients to translate business
hypotheses into data requirements and
technology solutions.
Expertise
Provides technical data engineering oversight
on projects and advises other data engineers
on architecture definition and performance
optimization for large-scale data wrangling.
Professional experience
Prior to joining QuantumBlack, Guilherme
specialised for over 18 years in Data
Warehouse and Business Intelligence projects
on large-scale environments. More recently, 6
years experience in Big Data projects and
architecture, lots of them at petabyte scale, as
well as real-time projects.
Previously led big data projects at Hortonworks,
SAP and large financial institutions.
BIOGRAPHY
Guilherme Braccialli
Principal Data Engineer, QuantumBlack,
Chicago
QuantumBlack
02
6All content copyright © 2017 QuantumBlack, a McKinsey company
QB exploit data, analytics and
design to help our clients be the
best they can be
We were born and proven in
Formula One, where the smallest
margins are the difference
between winning and losing and
data has emerged as a
fundamental element of
competitive advantage
QuantumBlack
6All content copyright © 2017 QuantumBlack, a McKinsey company
In elite sport the
smallest edge makes
the difference,
and the best teams
exploit this to outlearn
their rivals
8All content copyright © 2017 QuantumBlack, a McKinsey company
Since then, we have applied our proven
methodology across multiple sectors
Advanced
Industries
Aerospace
Automotive
Semi-Conductors
Urban Infrastructure
Financial
Services
Asset Management
Payment Networks
Private Banking
Retail Banking
Health &
Wellbeing
Hospitals
Medical Devices
Pharmaceutical
Natural
Resources
Oil & Gas
Mining
Renewable Energy
Utilities
Sports
Basketball
Baseball
Formula One
Soccer
Spark UDF Performance
03
- Share our learnings
- Running spark at scale
- Practical Examples
- Live demo (code)
Background
04
11All content copyright © 2017 QuantumBlack, a McKinsey company
• Open Source
‒ We are a consulting company, we enable our clients to use Advanced Analytics
‒ We don’t sell a out-of-box solution / licensing
‒ Clients can run it anywhere, we use open source tools
• Scalable
‒ We deal with big data volumes
‒ Multiple TBs of data
‒ Spark has several options to run on distributed mode (Hadoop, Kubernetes, Stand Alone)
• Flexibility and Integration
‒ Supports multiple languages: Python, SQL, Scala, Java, R
‒ Batch, Streaming, Graph, Machine Learning
‒ Easy to integrate with Data Scientist code, single data pipeline
Why we use spark
BACKGROUND
12All content copyright © 2017 QuantumBlack, a McKinsey company
• In the Cloud
‒ AWS (EMR)
‒ Azure (HDInsight)
‒ Google Cloud (DataProc)
‒ Databricks (AWS or Azure)
• On-premises
‒ Some clients have their internal hadoop cluster on premisses
Where we run
BACKGROUND
13All content copyright © 2017 QuantumBlack, a McKinsey company
Why PySpark / Performance implications
BACKGROUND
• PySpark is best choice to integrate data pipeline Data Engineering + Data Scientist
• Same performance for data frame operations (pyspark is a wrapper that runs native scala code)
• Performance hit when we use UDF (execution relies on: scala - python - scala)
• Pandas UDFs (Vectorized UDFs) + Arrow
‒ Nov/2017 – Spark 2.3
https://www.twosigma.com/insights/article/introducing-vectorized-udfs-for-pyspark/
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
‒ but… where are Scala numbers?
Benchmarking – Live Demo
05
15All content copyright © 2017 QuantumBlack, a McKinsey company
Databricks Notebook – (try on Community version)
LIVE DEMO
https://bit.ly/2E4ehIm
Conclusion and Our Approach
06
17All content copyright © 2017 QuantumBlack, a McKinsey company
Best of both worlds: PySpark with Scala performance
CONCLUSION AND OUR APPROACH
• Conclusion
‒ PySpark Pandas (Vectorized UDFs) can be faster than PySpark UDF, but not ALWAYS
‒ PySpark UDFs (vectorized or not) are much slower than scala UDFs
• Our Approach
‒ We use PySpark UDFs when data volume is not big, or quick insights on sample data
‒ Built an internal library with re-usable Scala UDFs
‒ Created Python wrappers to call Scala UDFs
‒ Demo
Q&A
07
Thank you!
- Would you like to share your
experiences on next events?
and…
- We are hiring!!!

Contenu connexe

Tendances

Pitch Deck Teardown: Careerist's $8M Series A deck
Pitch Deck Teardown: Careerist's $8M Series A deckPitch Deck Teardown: Careerist's $8M Series A deck
Pitch Deck Teardown: Careerist's $8M Series A deckHajeJanKamps
 
Power transactions and trends Q2 2019
Power transactions and trends Q2 2019Power transactions and trends Q2 2019
Power transactions and trends Q2 2019EY
 
Cracking the Code on Consumer Fraud | Accenture
Cracking the Code on Consumer Fraud | AccentureCracking the Code on Consumer Fraud | Accenture
Cracking the Code on Consumer Fraud | Accentureaccenture
 
When, Where & How AI Will Boost Federal Workforce Productivity
When, Where & How AI Will Boost Federal Workforce ProductivityWhen, Where & How AI Will Boost Federal Workforce Productivity
When, Where & How AI Will Boost Federal Workforce Productivityaccenture
 
Pitch Deck Teardown: Tanbii's $1.5M Pre-seed deck
Pitch Deck Teardown: Tanbii's $1.5M Pre-seed deckPitch Deck Teardown: Tanbii's $1.5M Pre-seed deck
Pitch Deck Teardown: Tanbii's $1.5M Pre-seed deckHajeJanKamps
 
Pitch Deck Teardown: Faye's $10M Series A deck
Pitch Deck Teardown: Faye's $10M Series A deckPitch Deck Teardown: Faye's $10M Series A deck
Pitch Deck Teardown: Faye's $10M Series A deckHajeJanKamps
 
Decision Analysis in Venture Capital Workshop, DAAG 2019
Decision Analysis in Venture Capital Workshop, DAAG 2019Decision Analysis in Venture Capital Workshop, DAAG 2019
Decision Analysis in Venture Capital Workshop, DAAG 2019Ulu Ventures
 
Pitch Deck Teardown: Oii.ai's $1.9M Seed deck
Pitch Deck Teardown: Oii.ai's $1.9M Seed deckPitch Deck Teardown: Oii.ai's $1.9M Seed deck
Pitch Deck Teardown: Oii.ai's $1.9M Seed deckHajeJanKamps
 
Pitch Deck Teardown: Honeycomb 's $50M Series D deck
Pitch Deck Teardown: Honeycomb 's $50M Series D deckPitch Deck Teardown: Honeycomb 's $50M Series D deck
Pitch Deck Teardown: Honeycomb 's $50M Series D deckHajeJanKamps
 
Pitch Deck Teardown: Transcend's $20M Series B deck
Pitch Deck Teardown: Transcend's $20M Series B deckPitch Deck Teardown: Transcend's $20M Series B deck
Pitch Deck Teardown: Transcend's $20M Series B deckHajeJanKamps
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsBoston Consulting Group
 
Pitch deck pointers_by_virginia_cha_2017
Pitch deck pointers_by_virginia_cha_2017Pitch deck pointers_by_virginia_cha_2017
Pitch deck pointers_by_virginia_cha_2017virginiacha
 
Fintech New York: Partnerships, Platforms and Open Innovation
Fintech New York: Partnerships, Platforms and Open InnovationFintech New York: Partnerships, Platforms and Open Innovation
Fintech New York: Partnerships, Platforms and Open Innovationaccenture
 
Pitch Deck Teardown: DeckMatch's $1M Seed deck
Pitch Deck Teardown: DeckMatch's $1M Seed deckPitch Deck Teardown: DeckMatch's $1M Seed deck
Pitch Deck Teardown: DeckMatch's $1M Seed deckHajeJanKamps
 
Pitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deckPitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deckHajeJanKamps
 
China Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE InvestorsChina Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE InvestorsL.E.K. Consulting
 
How Companies in Emerging Markets Are Winning at Home
How Companies in Emerging Markets Are Winning at HomeHow Companies in Emerging Markets Are Winning at Home
How Companies in Emerging Markets Are Winning at HomeBoston Consulting Group
 
PWC case PPT-2
PWC case PPT-2PWC case PPT-2
PWC case PPT-2Lily Chi
 

Tendances (20)

Pitch Deck Teardown: Careerist's $8M Series A deck
Pitch Deck Teardown: Careerist's $8M Series A deckPitch Deck Teardown: Careerist's $8M Series A deck
Pitch Deck Teardown: Careerist's $8M Series A deck
 
Power transactions and trends Q2 2019
Power transactions and trends Q2 2019Power transactions and trends Q2 2019
Power transactions and trends Q2 2019
 
Cracking the Code on Consumer Fraud | Accenture
Cracking the Code on Consumer Fraud | AccentureCracking the Code on Consumer Fraud | Accenture
Cracking the Code on Consumer Fraud | Accenture
 
When, Where & How AI Will Boost Federal Workforce Productivity
When, Where & How AI Will Boost Federal Workforce ProductivityWhen, Where & How AI Will Boost Federal Workforce Productivity
When, Where & How AI Will Boost Federal Workforce Productivity
 
Pitch Deck Teardown: Tanbii's $1.5M Pre-seed deck
Pitch Deck Teardown: Tanbii's $1.5M Pre-seed deckPitch Deck Teardown: Tanbii's $1.5M Pre-seed deck
Pitch Deck Teardown: Tanbii's $1.5M Pre-seed deck
 
Takeaways from a Simulated Cyber Attack
Takeaways from a Simulated Cyber AttackTakeaways from a Simulated Cyber Attack
Takeaways from a Simulated Cyber Attack
 
Pitch Deck Teardown: Faye's $10M Series A deck
Pitch Deck Teardown: Faye's $10M Series A deckPitch Deck Teardown: Faye's $10M Series A deck
Pitch Deck Teardown: Faye's $10M Series A deck
 
Decision Analysis in Venture Capital Workshop, DAAG 2019
Decision Analysis in Venture Capital Workshop, DAAG 2019Decision Analysis in Venture Capital Workshop, DAAG 2019
Decision Analysis in Venture Capital Workshop, DAAG 2019
 
Pitch Deck Teardown: Oii.ai's $1.9M Seed deck
Pitch Deck Teardown: Oii.ai's $1.9M Seed deckPitch Deck Teardown: Oii.ai's $1.9M Seed deck
Pitch Deck Teardown: Oii.ai's $1.9M Seed deck
 
Pitch Deck Teardown: Honeycomb 's $50M Series D deck
Pitch Deck Teardown: Honeycomb 's $50M Series D deckPitch Deck Teardown: Honeycomb 's $50M Series D deck
Pitch Deck Teardown: Honeycomb 's $50M Series D deck
 
Pitch Deck Teardown: Transcend's $20M Series B deck
Pitch Deck Teardown: Transcend's $20M Series B deckPitch Deck Teardown: Transcend's $20M Series B deck
Pitch Deck Teardown: Transcend's $20M Series B deck
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Artificial Intelligence: Have No Fear
Artificial Intelligence: Have No FearArtificial Intelligence: Have No Fear
Artificial Intelligence: Have No Fear
 
Pitch deck pointers_by_virginia_cha_2017
Pitch deck pointers_by_virginia_cha_2017Pitch deck pointers_by_virginia_cha_2017
Pitch deck pointers_by_virginia_cha_2017
 
Fintech New York: Partnerships, Platforms and Open Innovation
Fintech New York: Partnerships, Platforms and Open InnovationFintech New York: Partnerships, Platforms and Open Innovation
Fintech New York: Partnerships, Platforms and Open Innovation
 
Pitch Deck Teardown: DeckMatch's $1M Seed deck
Pitch Deck Teardown: DeckMatch's $1M Seed deckPitch Deck Teardown: DeckMatch's $1M Seed deck
Pitch Deck Teardown: DeckMatch's $1M Seed deck
 
Pitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deckPitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deck
 
China Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE InvestorsChina Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE Investors
 
How Companies in Emerging Markets Are Winning at Home
How Companies in Emerging Markets Are Winning at HomeHow Companies in Emerging Markets Are Winning at Home
How Companies in Emerging Markets Are Winning at Home
 
PWC case PPT-2
PWC case PPT-2PWC case PPT-2
PWC case PPT-2
 

Similaire à Meetup Spark UDF performance

Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXtsigitnist02
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloudSaama
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...DataWorks Summit
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...DataBench
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
BIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsBIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsSkillspeed
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRBWilliam Poos
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...SoftServe
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationAbdelkrim Hadjidj
 
BCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutionsBCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutionsBCC_Group
 
Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®Sunil Kempegowda
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturingCwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturingCapgemini
 
Bhadale group of companies projects portfolio
Bhadale group of companies  projects portfolioBhadale group of companies  projects portfolio
Bhadale group of companies projects portfolioVijayananda Mohire
 
Why Infrastructure matters?!
Why Infrastructure matters?!Why Infrastructure matters?!
Why Infrastructure matters?!Gabi Bauer
 

Similaire à Meetup Spark UDF performance (20)

Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
BIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsBIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in Logistics
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
BCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutionsBCC: offer for providers of SAP complementary solutions
BCC: offer for providers of SAP complementary solutions
 
Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®Architecting for the Cloud with TOGAF®
Architecting for the Cloud with TOGAF®
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturingCwin16 tls-partner-mark logic-an innovation journey in manufacturing
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
 
Bhadale group of companies projects portfolio
Bhadale group of companies  projects portfolioBhadale group of companies  projects portfolio
Bhadale group of companies projects portfolio
 
Why Infrastructure matters?!
Why Infrastructure matters?!Why Infrastructure matters?!
Why Infrastructure matters?!
 
Orange Data Centre and Cloud
Orange Data Centre and CloudOrange Data Centre and Cloud
Orange Data Centre and Cloud
 

Dernier

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Dernier (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Meetup Spark UDF performance

  • 1. Welcome Chicago Data Engineering Meetup - Our First Event – November 2018 - Objectives - Every 2 months - Format - sharing experiences (open for volunteers) - new tools / demos - Open for suggestions
  • 2. 01 Who I am 02 QuantumBlack 03 Today’s topic: Spark UDF Performance 04 Background 05 Benchmarking – Live demo 06 Conclusion and Our Approach 07 Q&A Agenda
  • 4. 4All content copyright © 2017 QuantumBlack, a McKinsey company Client case studies Experience across several industry sectors, including telecoms, retail, financial services and pharmaceuticals. Financial sector – Advanced Analytics projects for Fraud detection in Internet Banking and Credit Risk Modelling. Telecommunications – Petabyte scale environment, delivering several use cases, including: real-time failure detection using CDR data, customer profiling and marketing campaigns. Manufacturing– data wrangling in failure detection project for computer parts manufacturing in Europe. Pharmaceuticals – Site selection optimisation for a top pharma players. Telematics (Car insurance) – machine learning model that estimates the probability of crashing for each driver based data obtained from on board units box installed on cars containing geo-location positions, speed and acceleration of ~2 million drivers over a 2-year period. Complex feature creation using terabyte scale and external data sources such as weather, street and traffic data. Education Guilherme has a BSc in Data Processing from Mackenzie University and specialisations in Machine Learning and Business Intelligence. Role Big Data technology expert based in Chicago. Work with clients to translate business hypotheses into data requirements and technology solutions. Expertise Provides technical data engineering oversight on projects and advises other data engineers on architecture definition and performance optimization for large-scale data wrangling. Professional experience Prior to joining QuantumBlack, Guilherme specialised for over 18 years in Data Warehouse and Business Intelligence projects on large-scale environments. More recently, 6 years experience in Big Data projects and architecture, lots of them at petabyte scale, as well as real-time projects. Previously led big data projects at Hortonworks, SAP and large financial institutions. BIOGRAPHY Guilherme Braccialli Principal Data Engineer, QuantumBlack, Chicago
  • 6. 6All content copyright © 2017 QuantumBlack, a McKinsey company QB exploit data, analytics and design to help our clients be the best they can be We were born and proven in Formula One, where the smallest margins are the difference between winning and losing and data has emerged as a fundamental element of competitive advantage QuantumBlack 6All content copyright © 2017 QuantumBlack, a McKinsey company
  • 7. In elite sport the smallest edge makes the difference, and the best teams exploit this to outlearn their rivals
  • 8. 8All content copyright © 2017 QuantumBlack, a McKinsey company Since then, we have applied our proven methodology across multiple sectors Advanced Industries Aerospace Automotive Semi-Conductors Urban Infrastructure Financial Services Asset Management Payment Networks Private Banking Retail Banking Health & Wellbeing Hospitals Medical Devices Pharmaceutical Natural Resources Oil & Gas Mining Renewable Energy Utilities Sports Basketball Baseball Formula One Soccer
  • 9. Spark UDF Performance 03 - Share our learnings - Running spark at scale - Practical Examples - Live demo (code)
  • 11. 11All content copyright © 2017 QuantumBlack, a McKinsey company • Open Source ‒ We are a consulting company, we enable our clients to use Advanced Analytics ‒ We don’t sell a out-of-box solution / licensing ‒ Clients can run it anywhere, we use open source tools • Scalable ‒ We deal with big data volumes ‒ Multiple TBs of data ‒ Spark has several options to run on distributed mode (Hadoop, Kubernetes, Stand Alone) • Flexibility and Integration ‒ Supports multiple languages: Python, SQL, Scala, Java, R ‒ Batch, Streaming, Graph, Machine Learning ‒ Easy to integrate with Data Scientist code, single data pipeline Why we use spark BACKGROUND
  • 12. 12All content copyright © 2017 QuantumBlack, a McKinsey company • In the Cloud ‒ AWS (EMR) ‒ Azure (HDInsight) ‒ Google Cloud (DataProc) ‒ Databricks (AWS or Azure) • On-premises ‒ Some clients have their internal hadoop cluster on premisses Where we run BACKGROUND
  • 13. 13All content copyright © 2017 QuantumBlack, a McKinsey company Why PySpark / Performance implications BACKGROUND • PySpark is best choice to integrate data pipeline Data Engineering + Data Scientist • Same performance for data frame operations (pyspark is a wrapper that runs native scala code) • Performance hit when we use UDF (execution relies on: scala - python - scala) • Pandas UDFs (Vectorized UDFs) + Arrow ‒ Nov/2017 – Spark 2.3 https://www.twosigma.com/insights/article/introducing-vectorized-udfs-for-pyspark/ https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html ‒ but… where are Scala numbers?
  • 15. 15All content copyright © 2017 QuantumBlack, a McKinsey company Databricks Notebook – (try on Community version) LIVE DEMO https://bit.ly/2E4ehIm
  • 16. Conclusion and Our Approach 06
  • 17. 17All content copyright © 2017 QuantumBlack, a McKinsey company Best of both worlds: PySpark with Scala performance CONCLUSION AND OUR APPROACH • Conclusion ‒ PySpark Pandas (Vectorized UDFs) can be faster than PySpark UDF, but not ALWAYS ‒ PySpark UDFs (vectorized or not) are much slower than scala UDFs • Our Approach ‒ We use PySpark UDFs when data volume is not big, or quick insights on sample data ‒ Built an internal library with re-usable Scala UDFs ‒ Created Python wrappers to call Scala UDFs ‒ Demo
  • 19. Thank you! - Would you like to share your experiences on next events? and… - We are hiring!!!