SlideShare a Scribd company logo
1 of 31
© 2018 Dremio Corporation @DremioHQ
Using LLVM to accelerate processing of data in
Apache Arrow
DataWorks Summit, San Jose
June 21, 2018
Siddharth Teotia
1
© 2018 Dremio Corporation @DremioHQ
Who?
Siddharth Teotia
@siddcoder
loonytek
Quora
• Software Engineer @ Dremio
• Committer - Apache Arrow
• Formerly at Oracle (Database Engine team)
2
© 2018 Dremio Corporation @DremioHQ
Agenda
• Introduction to Apache Arrow
• Arrow in Practice: Introduction to Dremio
• Why Runtime Code Generation in Databases?
• Commonly used Runtime Code Generation Techniques
• Runtime Code Generation Requirements
• Introduction to LLVM
• LLVM in Dremio
3
© 2018 Dremio Corporation @DremioHQ
Apache Arrow Project
• Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Designed to work with any programming language
3. Flexible data model that handles both flat and nested types
• Developers from 13+ major open source projects involved.
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
4
© 2018 Dremio Corporation @DremioHQ
Arrow goals
• Columnar in-memory representation optimized for efficient use of processor
cache through data locality.
• Designed to take advantage of modern CPU characteristics by implementing
algorithms that leverage hardware acceleration.
• Interoperability for high speed exchange between data systems.
• Embeddable in execution engines, storage layers, etc.
• Well-documented and cross language compatible.
5
© 2018 Dremio Corporation @DremioHQ
High Performance Interface for Data Exchange
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)
6
Apache Arrow Adoption
© 2018 Dremio Corporation @DremioHQ
Focus on CPU Efficiency
Traditional Memory Buffer
( row format)
Arrow Memory Buffer
(columnar format)
• Maximize CPU throughput
– SIMD
– Cache Locality
• Vectorized operations.
• Constant value access
– With minimal structure
overhead
• Use efficient lightweight
compression schemes on a per
column basis.
8
© 2018 Dremio Corporation @DremioHQ
Arrow Data Types
• Scalars
– Boolean
– [u]int[8,16,32,64], Decimal, Float, Double
– Date, Time, Timestamp
– UTF8 String, Binary
• Complex
– Struct, List, Union
9
© 2018 Dremio Corporation @DremioHQ
Columnar Data
10
© 2018 Dremio Corporation @DremioHQ
Real World Arrow: Sabot
• Dremio is an OSS Data-as-a-
Service Platform
• The core engine is “Sabot”
– Built entirely on top of Arrow
libraries, runs in JVM
© 2018 Dremio Corporation @DremioHQ
Why Runtime Code Generation in Databases?
• In general, what would be the most optimal query execution plan?
– Hand-written query plan that does the required processing for exact same
data types and operators as required by the query.
– Such execution plan will only work for a particular query but will be the
fastest way to execute that query.
– We can implement extremely fast dedicated code to process _only_ the
kind of data touched by the query.
• However, query engines need to support broad functionality
– Several different data types, SQL operators etc.
– Interpreter based execution.
– Generic control blocks to understand arbitrary query specific runtime
information (field types etc which are not known during query compilation).
– Dynamic dispatch (aka virtual calls via function pointers in C++).
12
© 2018 Dremio Corporation @DremioHQ
Why Runtime Code Generation in Databases? Cont’d
• Interpreted (non code-generated) execution is not very CPU efficient and hurts
query performance
– Generic code not tailored for specific query has excessive branching
– Cost of branch misprediction: Entire pipeline has to be flushed.
– Not the best way to implement code paths critical for performance on
modern pipelined architectures
• Most databases generate code at runtime (query execution time)
– When query execution is about to begin, we have all the information
available that can be used to generate query specific code.
– The code-generated function(s) are specific to the query since they are
based on information resolved at runtime.
– Optimized custom code for executing a particular query.
13
© 2018 Dremio Corporation @DremioHQ
Commonly Used Runtime Code Generation Techniques
• Generate query specific Java classes at query runtime using predefined templates
– Use Janino to compile runtime generated classes in-memory to bytecode,
load and execute the bytecode in same JVM.
– Dremio uses this mechanism.
• Generate query specific C/C++ code at runtime, execv a compiler and load the
executable.
• Problems with existing code-generation mechanisms:
– Heavy object instantiation and dereferencing in generated Java code.
– Compiling and optimizing C/C++ code is known to be slow.
– Inefficient handling of complex and arbitrary SQL expressions.
– Limited opportunities for leveraging modern hardware capabilities
• SIMD vectorization, use of wider registers for handling decimals etc
© 2018 Dremio Corporation @DremioHQ
Runtime Code Generation Requirements
• Efficient code-generation
– The method to generate query specific code at runtime should itself be very
efficient.
– The method should be able to leverage target hardware capabilities.
• Query specific optimized code
– The method should generate highly optimized code to improve query
execution performance.
• Handle arbitrary complex SQL expressions efficiently
– The method should be able to handle complex SQL expressions efficiently.
© 2018 Dremio Corporation @DremioHQ
Introduction to LLVM
• A library providing compiler related modular tools for implementing JIT
compilation infrastructure.
• LLVM can be used to efficiently generate query specific optimized native
machine code at query runtime for performance critical operations.
• Potential for significant speedup in overall query execution time.
• Two high level steps:
– Generate IR (Intermediary Representation) code
– Compile and optimize IR to machine code targeting specific architecture
• IR is both source (language) and target (architecture) independent low-level
specification
• Custom optimization: separate passes to optimize the generated IR.
– Vectorizing loops, combining instructions etc.
• Full API support for all steps of compilation process
© 2018 Dremio Corporation @DremioHQ
Introduction to LLVM Cont’d
IR (Intermediary Representation) is the core of LLVM for code generation:
• A low-level assembly language like specification used by LLVM for representing
code during compilation.
• Generating IR using IRBuilder
– Part of C++ API provided by LLVM.
– Programmatically assemble IR modules/functions instruction by instruction.
• Generating IR using Cross-compilation
– Clang C++ compiler as a frontend to LLVM.
– Compile C++ functions to corresponding IR code.
© 2018 Dremio Corporation @DremioHQ
LLVM in Dremio
Goal: Use LLVM for efficient execution of SQL expressions in native code.
• Has the potential to significantly improve the performance of our execution
engine.
Welcome to Gandiva !!
© 2018 Dremio Corporation @DremioHQ
Gandiva - Introduction
• A standalone C++ library for efficient evaluation of arbitrary SQL
expressions on Arrow vectors using runtime code-generation in
LLVM.
• Has no runtime or compile time dependencies on Dremio or any
other execution engine.
• Provides Java APIs that use the JNI bridge underneath to talk to
C++ code for code generation and expression evaluation
– Dremio’s execution engine leverages Gandiva Java APIs
• Expression support
– If/Else, CASE, ==, !=, <, >, etc
– Function expressions: +, -, /, *, %
– All fixed width scalar types
– More to come
• Boolean expressions, variable width data, complex
types etc.
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
IR Generation
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
Tree Based Expression Builder
• Define the operator, operands, output at each level in the tree
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
High level usage of
main C++ modules
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design (Sample Usage)
// schema for input fields
auto fielda = field("a", int32()); auto fieldb = field("b", int32()); auto schema = arrow::schema({fielda, fieldb});
// output fields
auto field_result = field("res", int32());
// build expression
auto node_a = TreeExprBuilder::MakeField(fielda); auto node_b = TreeExprBuilder::MakeField(fieldb);
auto condition = TreeExprBuilder::MakeFunction("greater_than", {node_a, node_b}, boolean());
auto sum = TreeExprBuilder::MakeFunction("add", {node_a, node_b}, int32());
auto sub = TreeExprBuilder::MakeFunction("subtract", {node_a, node_b}, int32());
auto if_node = TreeExprBuilder::MakeIf(condition, sum, sub, int32());
auto expr = TreeExprBuilder::MakeExpression(if_node, field_result);
// Build a projector for the expressions
std::shared_ptr<Projector> projector; Status status = Projector::Make(schema, {expr}, pool_, &projector);
// Create an input Arrow record-batch with some sample data
// Evaluate expression on record batch arrow::ArrayVector outputs; status = projector->Evaluate(*in_batch, &outputs);
Expression:
if (a > b)
a + b
else
a - b
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
• Suitable for expressions of
type
– input is null -> output null
• Evaluate vector’s data
buffer and validity buffer
independently
– Reduced branches.
– Better CPU efficiency
– Amenable to SIMD.
– Junk data is also
evaluated but it
doesn’t affect the
end result
Expression
Decomposition
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
Expression
Decomposition
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
Categories of Function Expressions
NULL_IF_NULL NULL_NEVER NULL_INTERNAL
● Always decomposable
● If input null -> output null
● Input validity is pushed to top
of tree to determine validity of
output
● Highly optimized execution
● Eg: +, -, *, / etc
● Majority of functions
● Output is never null
● No need to push
validity for final result
● Eg: isNumeric(expr),
isNull(expr),
isDate(expr)
● Actual evaluation done
using conditions
● Output can be null
● Eg: castStringToInt(x) + y + z
● Evaluate sub-tree and
generate a local bitmap
● Rest of the tree uses local
bitmap to continue with
decomposed evaluation
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
Handling CASE Expressions
• Interpreting CASE as if-else-if statements loses
optimization opportunities
– Evaluation of same condition across multiple cases
– Evaluation of same validity across multiple cases
• Treat as switch case
• LLVM helps with removing redundant evaluation of validity
and and conditions across multiple cases
• A temporary bitmap is created and shared amongst all
expressions for computing validity of output
– Detect nested if-else and use a single bitmap
– Only the matching “if or else” updates bitmap
case
when cond1 then exp1
when cond2 then exp2
when cond3 then exp3
..
Else exp
© 2018 Dremio Corporation @DremioHQ
Using Gandiva in Dremio
© 2018 Dremio Corporation @DremioHQ
Performance
Java JIT runtime bytecode generation v/s Gandiva runtime code generation in LLVM
• Compare expression evaluation time of five simple expressions on JSON dataset
of 500 million rows
• Tests were run on Mac machine (2.7GHz quad-core Intel Core i7, 16GB RAM)
Project 5 columns
SELECT
sum(x + N2x + N3x),
sum(x * N2x - N3x),
sum(3 * x + 2 * N2x + N3x),
count(x >= N2x - N3x),
count(x + N2x = N3x)
FROM json.d500
Case - 10
SELECT count
(case
when x < 1000000 then x/1000000 + 0
when x < 2000000 then x/2000000 + 1
when x < 3000000 then x/3000000 + 2
when x < 4000000 then x/4000000 + 3
when x < 5000000 then x/5000000 + 4 ……………...
else 10 end)
FROM json.d500
© 2018 Dremio Corporation @DremioHQ
Performance
Test Project time (secs)
with Java JIT
Project time (secs)
with Gandiva LLVM
Improvement
Sum 3.805 0.558 6.8x
Project 5 columns 8.681 1.689 5.13x
Project 10 columns 24.923 3.476 7.74x
CASE-10 4.308 0.925 4.66x
CASE-100 1361 15.187 89.6x
© 2018 Dremio Corporation @DremioHQ
Get Involved
• Gandiva
– https://github.com/dremio/gandiva
• Arrow
– dev@arrow.apache.org
– http://arrow.apache.org
– Follow @ApacheArrow, @DremioHQ
• Dremio
– https://community.dremio.com/
– https://github.com/dremio/dremio-oss

More Related Content

What's hot

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...InfluxData
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOAltinity Ltd
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 

What's hot (20)

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 

Similar to Using LLVM to accelerate processing of data in Apache Arrow

Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examplesLuciano Resende
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeLuciano Resende
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
IBM Watson & PHP, A Practical Demonstration
IBM Watson & PHP, A Practical DemonstrationIBM Watson & PHP, A Practical Demonstration
IBM Watson & PHP, A Practical DemonstrationClark Everetts
 
Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsRogue Wave Software
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdfPramodhN3
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
 
4 Paradigm Shifts for the Connected Car of the Future
4 Paradigm Shifts for the Connected Car of the Future4 Paradigm Shifts for the Connected Car of the Future
4 Paradigm Shifts for the Connected Car of the FutureHiveMQ
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Codemotion
 
Digital Security by Design: Imperas’ Interests - Simon Davidmann, Imperas Sof...
Digital Security by Design: Imperas’ Interests - Simon Davidmann, Imperas Sof...Digital Security by Design: Imperas’ Interests - Simon Davidmann, Imperas Sof...
Digital Security by Design: Imperas’ Interests - Simon Davidmann, Imperas Sof...KTN
 
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...Amazon Web Services
 
A Tale of Two Pizzas: Accelerating Software Delivery with AWS Developer Tools
A Tale of Two Pizzas: Accelerating Software Delivery with AWS Developer ToolsA Tale of Two Pizzas: Accelerating Software Delivery with AWS Developer Tools
A Tale of Two Pizzas: Accelerating Software Delivery with AWS Developer ToolsAmazon Web Services
 
Http Services in Rust on Containers
Http Services in Rust on ContainersHttp Services in Rust on Containers
Http Services in Rust on ContainersAnton Whalley
 
Meetup. Technologies Intro for Non-Tech People
Meetup. Technologies Intro for Non-Tech PeopleMeetup. Technologies Intro for Non-Tech People
Meetup. Technologies Intro for Non-Tech PeopleIT Arena
 
NetWork - 15.10.2011 - Applied code generation in .NET
NetWork - 15.10.2011 - Applied code generation in .NET NetWork - 15.10.2011 - Applied code generation in .NET
NetWork - 15.10.2011 - Applied code generation in .NET Dmytro Mindra
 
Click, Click, Test - Automated Tests for APEX Applications
Click, Click, Test - Automated Tests for APEX ApplicationsClick, Click, Test - Automated Tests for APEX Applications
Click, Click, Test - Automated Tests for APEX ApplicationsKai Donato
 
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CAAdobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CAJaemi Bremner
 

Similar to Using LLVM to accelerate processing of data in Apache Arrow (20)

Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
IBM Watson & PHP, A Practical Demonstration
IBM Watson & PHP, A Practical DemonstrationIBM Watson & PHP, A Practical Demonstration
IBM Watson & PHP, A Practical Demonstration
 
Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applications
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdf
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
4 Paradigm Shifts for the Connected Car of the Future
4 Paradigm Shifts for the Connected Car of the Future4 Paradigm Shifts for the Connected Car of the Future
4 Paradigm Shifts for the Connected Car of the Future
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
 
Digital Security by Design: Imperas’ Interests - Simon Davidmann, Imperas Sof...
Digital Security by Design: Imperas’ Interests - Simon Davidmann, Imperas Sof...Digital Security by Design: Imperas’ Interests - Simon Davidmann, Imperas Sof...
Digital Security by Design: Imperas’ Interests - Simon Davidmann, Imperas Sof...
 
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
 
A Tale of Two Pizzas: Accelerating Software Delivery with AWS Developer Tools
A Tale of Two Pizzas: Accelerating Software Delivery with AWS Developer ToolsA Tale of Two Pizzas: Accelerating Software Delivery with AWS Developer Tools
A Tale of Two Pizzas: Accelerating Software Delivery with AWS Developer Tools
 
Http Services in Rust on Containers
Http Services in Rust on ContainersHttp Services in Rust on Containers
Http Services in Rust on Containers
 
Meetup. Technologies Intro for Non-Tech People
Meetup. Technologies Intro for Non-Tech PeopleMeetup. Technologies Intro for Non-Tech People
Meetup. Technologies Intro for Non-Tech People
 
NetWork - 15.10.2011 - Applied code generation in .NET
NetWork - 15.10.2011 - Applied code generation in .NET NetWork - 15.10.2011 - Applied code generation in .NET
NetWork - 15.10.2011 - Applied code generation in .NET
 
Click, Click, Test - Automated Tests for APEX Applications
Click, Click, Test - Automated Tests for APEX ApplicationsClick, Click, Test - Automated Tests for APEX Applications
Click, Click, Test - Automated Tests for APEX Applications
 
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CAAdobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Using LLVM to accelerate processing of data in Apache Arrow

  • 1. © 2018 Dremio Corporation @DremioHQ Using LLVM to accelerate processing of data in Apache Arrow DataWorks Summit, San Jose June 21, 2018 Siddharth Teotia 1
  • 2. © 2018 Dremio Corporation @DremioHQ Who? Siddharth Teotia @siddcoder loonytek Quora • Software Engineer @ Dremio • Committer - Apache Arrow • Formerly at Oracle (Database Engine team) 2
  • 3. © 2018 Dremio Corporation @DremioHQ Agenda • Introduction to Apache Arrow • Arrow in Practice: Introduction to Dremio • Why Runtime Code Generation in Databases? • Commonly used Runtime Code Generation Techniques • Runtime Code Generation Requirements • Introduction to LLVM • LLVM in Dremio 3
  • 4. © 2018 Dremio Corporation @DremioHQ Apache Arrow Project • Top-level Apache Software Foundation project – Announced Feb 17, 2016 • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Designed to work with any programming language 3. Flexible data model that handles both flat and nested types • Developers from 13+ major open source projects involved. Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R 4
  • 5. © 2018 Dremio Corporation @DremioHQ Arrow goals • Columnar in-memory representation optimized for efficient use of processor cache through data locality. • Designed to take advantage of modern CPU characteristics by implementing algorithms that leverage hardware acceleration. • Interoperability for high speed exchange between data systems. • Embeddable in execution engines, storage layers, etc. • Well-documented and cross language compatible. 5
  • 6. © 2018 Dremio Corporation @DremioHQ High Performance Interface for Data Exchange • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader) 6
  • 8. © 2018 Dremio Corporation @DremioHQ Focus on CPU Efficiency Traditional Memory Buffer ( row format) Arrow Memory Buffer (columnar format) • Maximize CPU throughput – SIMD – Cache Locality • Vectorized operations. • Constant value access – With minimal structure overhead • Use efficient lightweight compression schemes on a per column basis. 8
  • 9. © 2018 Dremio Corporation @DremioHQ Arrow Data Types • Scalars – Boolean – [u]int[8,16,32,64], Decimal, Float, Double – Date, Time, Timestamp – UTF8 String, Binary • Complex – Struct, List, Union 9
  • 10. © 2018 Dremio Corporation @DremioHQ Columnar Data 10
  • 11. © 2018 Dremio Corporation @DremioHQ Real World Arrow: Sabot • Dremio is an OSS Data-as-a- Service Platform • The core engine is “Sabot” – Built entirely on top of Arrow libraries, runs in JVM
  • 12. © 2018 Dremio Corporation @DremioHQ Why Runtime Code Generation in Databases? • In general, what would be the most optimal query execution plan? – Hand-written query plan that does the required processing for exact same data types and operators as required by the query. – Such execution plan will only work for a particular query but will be the fastest way to execute that query. – We can implement extremely fast dedicated code to process _only_ the kind of data touched by the query. • However, query engines need to support broad functionality – Several different data types, SQL operators etc. – Interpreter based execution. – Generic control blocks to understand arbitrary query specific runtime information (field types etc which are not known during query compilation). – Dynamic dispatch (aka virtual calls via function pointers in C++). 12
  • 13. © 2018 Dremio Corporation @DremioHQ Why Runtime Code Generation in Databases? Cont’d • Interpreted (non code-generated) execution is not very CPU efficient and hurts query performance – Generic code not tailored for specific query has excessive branching – Cost of branch misprediction: Entire pipeline has to be flushed. – Not the best way to implement code paths critical for performance on modern pipelined architectures • Most databases generate code at runtime (query execution time) – When query execution is about to begin, we have all the information available that can be used to generate query specific code. – The code-generated function(s) are specific to the query since they are based on information resolved at runtime. – Optimized custom code for executing a particular query. 13
  • 14. © 2018 Dremio Corporation @DremioHQ Commonly Used Runtime Code Generation Techniques • Generate query specific Java classes at query runtime using predefined templates – Use Janino to compile runtime generated classes in-memory to bytecode, load and execute the bytecode in same JVM. – Dremio uses this mechanism. • Generate query specific C/C++ code at runtime, execv a compiler and load the executable. • Problems with existing code-generation mechanisms: – Heavy object instantiation and dereferencing in generated Java code. – Compiling and optimizing C/C++ code is known to be slow. – Inefficient handling of complex and arbitrary SQL expressions. – Limited opportunities for leveraging modern hardware capabilities • SIMD vectorization, use of wider registers for handling decimals etc
  • 15. © 2018 Dremio Corporation @DremioHQ Runtime Code Generation Requirements • Efficient code-generation – The method to generate query specific code at runtime should itself be very efficient. – The method should be able to leverage target hardware capabilities. • Query specific optimized code – The method should generate highly optimized code to improve query execution performance. • Handle arbitrary complex SQL expressions efficiently – The method should be able to handle complex SQL expressions efficiently.
  • 16. © 2018 Dremio Corporation @DremioHQ Introduction to LLVM • A library providing compiler related modular tools for implementing JIT compilation infrastructure. • LLVM can be used to efficiently generate query specific optimized native machine code at query runtime for performance critical operations. • Potential for significant speedup in overall query execution time. • Two high level steps: – Generate IR (Intermediary Representation) code – Compile and optimize IR to machine code targeting specific architecture • IR is both source (language) and target (architecture) independent low-level specification • Custom optimization: separate passes to optimize the generated IR. – Vectorizing loops, combining instructions etc. • Full API support for all steps of compilation process
  • 17. © 2018 Dremio Corporation @DremioHQ Introduction to LLVM Cont’d IR (Intermediary Representation) is the core of LLVM for code generation: • A low-level assembly language like specification used by LLVM for representing code during compilation. • Generating IR using IRBuilder – Part of C++ API provided by LLVM. – Programmatically assemble IR modules/functions instruction by instruction. • Generating IR using Cross-compilation – Clang C++ compiler as a frontend to LLVM. – Compile C++ functions to corresponding IR code.
  • 18. © 2018 Dremio Corporation @DremioHQ LLVM in Dremio Goal: Use LLVM for efficient execution of SQL expressions in native code. • Has the potential to significantly improve the performance of our execution engine. Welcome to Gandiva !!
  • 19. © 2018 Dremio Corporation @DremioHQ Gandiva - Introduction • A standalone C++ library for efficient evaluation of arbitrary SQL expressions on Arrow vectors using runtime code-generation in LLVM. • Has no runtime or compile time dependencies on Dremio or any other execution engine. • Provides Java APIs that use the JNI bridge underneath to talk to C++ code for code generation and expression evaluation – Dremio’s execution engine leverages Gandiva Java APIs • Expression support – If/Else, CASE, ==, !=, <, >, etc – Function expressions: +, -, /, *, % – All fixed width scalar types – More to come • Boolean expressions, variable width data, complex types etc.
  • 20. © 2018 Dremio Corporation @DremioHQ Gandiva - Design IR Generation
  • 21. © 2018 Dremio Corporation @DremioHQ Gandiva - Design Tree Based Expression Builder • Define the operator, operands, output at each level in the tree
  • 22. © 2018 Dremio Corporation @DremioHQ Gandiva - Design High level usage of main C++ modules
  • 23. © 2018 Dremio Corporation @DremioHQ Gandiva - Design (Sample Usage) // schema for input fields auto fielda = field("a", int32()); auto fieldb = field("b", int32()); auto schema = arrow::schema({fielda, fieldb}); // output fields auto field_result = field("res", int32()); // build expression auto node_a = TreeExprBuilder::MakeField(fielda); auto node_b = TreeExprBuilder::MakeField(fieldb); auto condition = TreeExprBuilder::MakeFunction("greater_than", {node_a, node_b}, boolean()); auto sum = TreeExprBuilder::MakeFunction("add", {node_a, node_b}, int32()); auto sub = TreeExprBuilder::MakeFunction("subtract", {node_a, node_b}, int32()); auto if_node = TreeExprBuilder::MakeIf(condition, sum, sub, int32()); auto expr = TreeExprBuilder::MakeExpression(if_node, field_result); // Build a projector for the expressions std::shared_ptr<Projector> projector; Status status = Projector::Make(schema, {expr}, pool_, &projector); // Create an input Arrow record-batch with some sample data // Evaluate expression on record batch arrow::ArrayVector outputs; status = projector->Evaluate(*in_batch, &outputs); Expression: if (a > b) a + b else a - b
  • 24. © 2018 Dremio Corporation @DremioHQ Gandiva - Design • Suitable for expressions of type – input is null -> output null • Evaluate vector’s data buffer and validity buffer independently – Reduced branches. – Better CPU efficiency – Amenable to SIMD. – Junk data is also evaluated but it doesn’t affect the end result Expression Decomposition
  • 25. © 2018 Dremio Corporation @DremioHQ Gandiva - Design Expression Decomposition
  • 26. © 2018 Dremio Corporation @DremioHQ Gandiva - Design Categories of Function Expressions NULL_IF_NULL NULL_NEVER NULL_INTERNAL ● Always decomposable ● If input null -> output null ● Input validity is pushed to top of tree to determine validity of output ● Highly optimized execution ● Eg: +, -, *, / etc ● Majority of functions ● Output is never null ● No need to push validity for final result ● Eg: isNumeric(expr), isNull(expr), isDate(expr) ● Actual evaluation done using conditions ● Output can be null ● Eg: castStringToInt(x) + y + z ● Evaluate sub-tree and generate a local bitmap ● Rest of the tree uses local bitmap to continue with decomposed evaluation
  • 27. © 2018 Dremio Corporation @DremioHQ Gandiva - Design Handling CASE Expressions • Interpreting CASE as if-else-if statements loses optimization opportunities – Evaluation of same condition across multiple cases – Evaluation of same validity across multiple cases • Treat as switch case • LLVM helps with removing redundant evaluation of validity and and conditions across multiple cases • A temporary bitmap is created and shared amongst all expressions for computing validity of output – Detect nested if-else and use a single bitmap – Only the matching “if or else” updates bitmap case when cond1 then exp1 when cond2 then exp2 when cond3 then exp3 .. Else exp
  • 28. © 2018 Dremio Corporation @DremioHQ Using Gandiva in Dremio
  • 29. © 2018 Dremio Corporation @DremioHQ Performance Java JIT runtime bytecode generation v/s Gandiva runtime code generation in LLVM • Compare expression evaluation time of five simple expressions on JSON dataset of 500 million rows • Tests were run on Mac machine (2.7GHz quad-core Intel Core i7, 16GB RAM) Project 5 columns SELECT sum(x + N2x + N3x), sum(x * N2x - N3x), sum(3 * x + 2 * N2x + N3x), count(x >= N2x - N3x), count(x + N2x = N3x) FROM json.d500 Case - 10 SELECT count (case when x < 1000000 then x/1000000 + 0 when x < 2000000 then x/2000000 + 1 when x < 3000000 then x/3000000 + 2 when x < 4000000 then x/4000000 + 3 when x < 5000000 then x/5000000 + 4 ……………... else 10 end) FROM json.d500
  • 30. © 2018 Dremio Corporation @DremioHQ Performance Test Project time (secs) with Java JIT Project time (secs) with Gandiva LLVM Improvement Sum 3.805 0.558 6.8x Project 5 columns 8.681 1.689 5.13x Project 10 columns 24.923 3.476 7.74x CASE-10 4.308 0.925 4.66x CASE-100 1361 15.187 89.6x
  • 31. © 2018 Dremio Corporation @DremioHQ Get Involved • Gandiva – https://github.com/dremio/gandiva • Arrow – dev@arrow.apache.org – http://arrow.apache.org – Follow @ApacheArrow, @DremioHQ • Dremio – https://community.dremio.com/ – https://github.com/dremio/dremio-oss