Submit Search
Upload
Mule soft mar 2017 Parquet Arrow
•
Download as PPTX, PDF
•
1 like
•
479 views
Julien Le Dem
Follow
The future of column-oriented data processing with Arrow and Parquet
Read less
Read more
Software
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 20
Download now
Recommended
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
Camuel Gilyadov
Recommended
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
Camuel Gilyadov
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
techmaddy
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
From flat files to deconstructed database
From flat files to deconstructed database
Julien Le Dem
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
John Mulhall
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Uwe Korn
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
DataWorks Summit/Hadoop Summit
NoSQL Needs SomeSQL
NoSQL Needs SomeSQL
DataWorks Summit
HUG France - Apache Drill
HUG France - Apache Drill
MapR Technologies
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
DataWorks Summit
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
The HDF-EOS Tools and Information Center
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
More Related Content
What's hot
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
techmaddy
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
From flat files to deconstructed database
From flat files to deconstructed database
Julien Le Dem
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
John Mulhall
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Uwe Korn
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
DataWorks Summit/Hadoop Summit
NoSQL Needs SomeSQL
NoSQL Needs SomeSQL
DataWorks Summit
HUG France - Apache Drill
HUG France - Apache Drill
MapR Technologies
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
DataWorks Summit
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
The HDF-EOS Tools and Information Center
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
What's hot
(20)
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
From flat files to deconstructed database
From flat files to deconstructed database
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
NoSQL Needs SomeSQL
NoSQL Needs SomeSQL
HUG France - Apache Drill
HUG France - Apache Drill
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Similar to Mule soft mar 2017 Parquet Arrow
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
BPM and SOA Are Going Mobile: An Architectural Perspective
BPM and SOA Are Going Mobile: An Architectural Perspective
Guido Schmutz
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
Evolve The Adobe Digital Marketing Community
Apache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Precisely
Mastering Cloud Data Cost Control: A FinOps Approach
Mastering Cloud Data Cost Control: A FinOps Approach
Denodo
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
Denodo
DataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
Graphical DSL with Sirius: how to simplify the creation of custom modeling tools
Graphical DSL with Sirius: how to simplify the creation of custom modeling tools
Etienne Juliot
Why Select 3DPrinterOS as your answer for 3D Printer Deployment
Why Select 3DPrinterOS as your answer for 3D Printer Deployment
Laris Orman
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
Amazon Web Services
Is OLAP Dead?: Can Next Gen Tools Take Over?
Is OLAP Dead?: Can Next Gen Tools Take Over?
Senturus
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
Amazon Web Services
Cincom Smalltalk 2017 Roadmap
Cincom Smalltalk 2017 Roadmap
ESUG
IBM DS8880 and IBM Z - Integrated by Design
IBM DS8880 and IBM Z - Integrated by Design
Stefan Lein
Similar to Mule soft mar 2017 Parquet Arrow
(20)
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
BPM and SOA Are Going Mobile: An Architectural Perspective
BPM and SOA Are Going Mobile: An Architectural Perspective
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
Apache Arrow - An Overview
Apache Arrow - An Overview
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Mastering Cloud Data Cost Control: A FinOps Approach
Mastering Cloud Data Cost Control: A FinOps Approach
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
DataFrames: The Extended Cut
DataFrames: The Extended Cut
Graphical DSL with Sirius: how to simplify the creation of custom modeling tools
Graphical DSL with Sirius: how to simplify the creation of custom modeling tools
Why Select 3DPrinterOS as your answer for 3D Printer Deployment
Why Select 3DPrinterOS as your answer for 3D Printer Deployment
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
Is OLAP Dead?: Can Next Gen Tools Take Over?
Is OLAP Dead?: Can Next Gen Tools Take Over?
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
Cincom Smalltalk 2017 Roadmap
Cincom Smalltalk 2017 Roadmap
IBM DS8880 and IBM Z - Integrated by Design
IBM DS8880 and IBM Z - Integrated by Design
More from Julien Le Dem
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
Sql on everything with drill
Sql on everything with drill
Julien Le Dem
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
Parquet Twitter Seattle open house
Parquet Twitter Seattle open house
Julien Le Dem
Parquet overview
Parquet overview
Julien Le Dem
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
Julien Le Dem
Embedding Pig in scripting languages
Embedding Pig in scripting languages
Julien Le Dem
More from Julien Le Dem
(14)
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Sql on everything with drill
Sql on everything with drill
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Parquet Twitter Seattle open house
Parquet Twitter Seattle open house
Parquet overview
Parquet overview
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
Embedding Pig in scripting languages
Embedding Pig in scripting languages
Recently uploaded
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
Devintelle Consulting Service Pvt Ltd Odoo OpenERP
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
FerryKemperman
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
BradBedford3
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
kalichargn70th171
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Drew Moseley
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
Sujith Sukumaran
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
Hr365.us smith
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
Envertis Software Solutions
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
OnePlan Solutions
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
Christoph Pohl
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
confluent
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
rcbcrtm
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Cizo Technology Services
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
Marharyta Nedzelska
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
preethippts
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
Dinusha Kumarasiri
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
Wave PLM
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
andrehoraa
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Matt Ray
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
vyaparkranti
Recently uploaded
(20)
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
Mule soft mar 2017 Parquet Arrow
1.
© 2017 Dremio
Corporation @DremioHQ The future of column-oriented data processing with Arrow and Parquet Julien Le Dem, Principal Architect Dremio, VP Apache Parquet
2.
© 2017 Dremio
Corporation @DremioHQ • Architect at @DremioHQ • Formerly Tech Lead at Twitter on Data Platforms. • Creator of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Julien
3.
© 2017 Dremio
Corporation @DremioHQ Agenda • Community Driven Standard – Interoperability and Ecosystem • Benefits of Columnar representation – On disk (Apache Parquet) – In memory (Apache Arrow) • Future of columnar
4.
© 2017 Dremio
Corporation @DremioHQ Community Driven Standard • Parquet: Common need for on disk columnar. • Arrow: Common need for in memory columnar. • Arrow building on the success of Parquet. • Benefits: – Share the effort – Create an ecosystem • Standard from the start
5.
© 2017 Dremio
Corporation @DremioHQ Arrow goals • Well-documented and cross language compatible • Designed to take advantage of the modern CPU • Embeddable in execution engines, storage layers, etc. • Interoperable
6.
© 2017 Dremio
Corporation @DremioHQ Interoperability and Ecosystem Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on marshalling • Duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Shared functionality (Parquet-to-Arrow reader) High Performance Sharing & InterchangeHigh Interoperability cost:
7.
© 2017 Dremio
Corporation @DremioHQ Benefits of Columnar formats @EmrgencyKittens
8.
© 2017 Dremio
Corporation @DremioHQ Columnar layout Logical table representation Row layout Column layout
9.
© 2017 Dremio
Corporation @DremioHQ On Disk and in Memory • Different trade offs – On disk: Storage. • Accessed by multiple queries. • Priority to I/O reduction (but still needs good CPU throughput). • Mostly Streaming access. – In memory: Transient. • Specific to one query execution. • Priority to CPU throughput (but still needs good I/O). • Streaming and Random access.
10.
© 2017 Dremio
Corporation @DremioHQ Parquet on disk columnar format • Nested data structures • Compact format: – type aware encodings – better compression • Optimized I/O: – Projection push down (column pruning) – Predicate push down (filters based on stats)
11.
© 2017 Dremio
Corporation @DremioHQ Access only the data you need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read only the data you need!
12.
© 2017 Dremio
Corporation @DremioHQ Arrow in memory columnar format • Nested Data Structures • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
13.
© 2017 Dremio
Corporation @DremioHQ Focus on CPU Efficiency Arrow Memory Buffer • Cache Locality • Super-scalar & vectorized operation • Minimal Structure Overhead • Constant value access – With minimal structure overhead • Operate directly on columnar data
14.
© 2017 Dremio
Corporation @DremioHQ Arrow Messages, RPC & IPC
15.
© 2017 Dremio
Corporation @DremioHQ Common Message Pattern • Schema Negotiation – Logical Description of structure – Identification of dictionary encoded Nodes • Dictionary Batch – Dictionary ID, Values • Record Batch – Batches of records up to 64K – Leaf nodes up to 2B values Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch 1..N Batches 0..N Batches
16.
© 2017 Dremio
Corporation @DremioHQ Multi-system IPC SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads
17.
© 2017 Dremio
Corporation @DremioHQ Summary and Future
18.
© 2017 Dremio
Corporation @DremioHQ Current activity: • Spark Integration (SPARK-13534) • Dictionary encoding (ARROW-542) • Time related types finalization (ARROW-617) • Arrow REST API • Bindings: – C, Ruby (ARROW-631) – JavaScript (ARROW-541)
19.
© 2017 Dremio
Corporation @DremioHQ Some results - PySpark Integration: 53x speedup (IBM spark work on SPARK-13534) http://s.apache.org/arrowstrata1 - Streaming Arrow Performance 7.75GB/s data movement http://s.apache.org/arrowstrata2 - Arrow Parquet C++ Integration 4GB/s reads http://s.apache.org/arrowstrata3 - Pandas Integration 9.71GB/s http://s.apache.org/arrowstrata4
20.
© 2017 Dremio
Corporation @DremioHQ Get Involved • Join the community – dev@{arrow,parquet}.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ • https://parquet-slack-invite.herokuapp.com/ – http://{arrow,parquet}.apache.org – Follow @Apache{Parquet,Arrow}
Download now