Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Paul Dix
@pauldix
paul@influxdata.com
InfluxDB IOx Project
Update
© 2020 InfluxData. All rights reserved.2
Foundational Work
• Backward compatibility & integration (Flux & InfluxQL)
• In-m...
© 2020 InfluxData. All rights reserved.3
Integration
InfluxDB 2.x Flux InfluxQL
InfluxDB IOx
GRPC
© 2020 InfluxData. All rights reserved.4
How data is organized
mydata
2020-12-07 2020-12-08 2020-12-09 Partitions
1 2 3211...
© 2020 InfluxData. All rights reserved.5
Data lifecycle
WAL Buffer Write Buffer
WAL Segment
(object store)
Chunks
Segment ...
© 2020 InfluxData. All rights reserved.6
When can you use
it?
Thank You
© 2020 InfluxData. All rights reserved.8
A Rusty Introduction to Apache
Arrow and how it Applies to a
Time Series Database...
© 2020 InfluxData. All rights reserved.9
IOx Team at InfluxData
Query Optimizer / Architect @ Vertica
(Columnar Database),...
© 2020 InfluxData. All rights reserved.10
Goals + Outline
Goal: ⇒ Arrow is a good basis for a new (time series) Databases ...
© 2020 InfluxData. All rights reserved.11
Databases -- Trend Towards Specialization
Relational
Key-Value
Timeseries
Graph
...
© 2020 InfluxData. All rights reserved.12
… and our new database is …
🎉
InfluxDB IOx - The Future Core of InfluxDB Built
w...
© 2020 InfluxData. All rights reserved.13
Analytic Systems (vs Transactional)
● Transactional (OLTP, Key-value stores, etc...
© 2020 InfluxData. All rights reserved.14
So, you want to build a new database… ?
Databases need many features just to loo...
© 2020 InfluxData. All rights reserved.15
Implementation timeline for a new Database system
Client
API
In memory
storage
I...
© 2020 InfluxData. All rights reserved.16
Arrow Project Goals
“Build a better open source
foundation for data science”
How...
© 2020 InfluxData. All rights reserved.17
Arrow == toolkit for a modern analytic databases
match tool_needed {
File Format...
© 2020 InfluxData. All rights reserved.18
InfluxDB line protocol
weather,location=us-east temperature=82,humidity=67 14658...
© 2020 InfluxData. All rights reserved.19
IOx Data Model
weather,location=us-east temperature=82,humidity=67 1465839830100...
© 2020 InfluxData. All rights reserved.20
Code Examples
Thesis: “When writing an analytic database, you will end up
implem...
© 2020 InfluxData. All rights reserved.21
Motivating Example
“Find the rows that are not in `us-west`”
© 2020 InfluxData. All rights reserved.22
Create the Array
let string_vec: Vec<String> =
(0..NUM_TAGS)
.map(|i| {
match i ...
© 2020 InfluxData. All rights reserved.23
Memory Footprint
let size =
size_of::<Vec<String>>() +
string_vec
.iter()
.fold(...
© 2020 InfluxData. All rights reserved.24
Find Rows != “us-west”
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(...
© 2020 InfluxData. All rights reserved.25
Find Rows != “us-west” (with null handling)
let string_vec: Vec<Option<String>> ...
© 2020 InfluxData. All rights reserved.26
Materialize rows for future processing
let not_west: Vec<String> = not_west_bits...
© 2020 InfluxData. All rights reserved.27
More efficient encoding (dictionary)
let vb = StringBuilder::new();
let kb = Int...
© 2020 InfluxData. All rights reserved.28
SIMD Anyone?
let output = gt(
&left,
&right
).unwrap();
+10
20
17
5
23
5
9
12
4
...
© 2020 InfluxData. All rights reserved.29
SIMD Implementation
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
...
© 2020 InfluxData. All rights reserved.30
Other things needed in a database
Vec<Option<String>> to support nulls
Handle ot...
© 2020 InfluxData. All rights reserved.31
Rust / Arrow Community: Good and Getting better
Major Roadmap Items (see also Ap...
© 2020 InfluxData. All rights reserved.32
Thank You
Find us online
Github: https://github.com/influxdata/influxdb_iox
Slac...
Prochain SlideShare
Chargement dans…5
×

InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it Applies to a Time Series Database

InfluxDB IOx Tech Talks - December 2020
A Rusty Introduction to Apache Arrow and How it Applies to a Time Series Database

This session will start with a tech talk from an InfluxDB IOx team member. This is your chance to interact directly with Influxers who are available to answer your questions about all things InfluxDB IOx and time series — including Paul Dix, Founder and CTO of InfluxData. This event will last about an hour and there will be time for live Q&A.

  • Soyez le premier à commenter

InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it Applies to a Time Series Database

  1. 1. Paul Dix @pauldix paul@influxdata.com InfluxDB IOx Project Update
  2. 2. © 2020 InfluxData. All rights reserved.2 Foundational Work • Backward compatibility & integration (Flux & InfluxQL) • In-memory columnar store (Segment Store) • Data lifecycle (WAL, Parquet) • Apache Arrow Flight (RPC)
  3. 3. © 2020 InfluxData. All rights reserved.3 Integration InfluxDB 2.x Flux InfluxQL InfluxDB IOx GRPC
  4. 4. © 2020 InfluxData. All rights reserved.4 How data is organized mydata 2020-12-07 2020-12-08 2020-12-09 Partitions 1 2 3211 2 Chunks Database
  5. 5. © 2020 InfluxData. All rights reserved.5 Data lifecycle WAL Buffer Write Buffer WAL Segment (object store) Chunks Segment Store (in memory) Parquet Files (object store) Ingest Writable Immutable
  6. 6. © 2020 InfluxData. All rights reserved.6 When can you use it?
  7. 7. Thank You
  8. 8. © 2020 InfluxData. All rights reserved.8 A Rusty Introduction to Apache Arrow and how it Applies to a Time Series Database December 9, 2020 Andrew Lamb InfluxData
  9. 9. © 2020 InfluxData. All rights reserved.9 IOx Team at InfluxData Query Optimizer / Architect @ Vertica (Columnar Database), Chief Architect @ DataRobot (Machine Learning Platform ) Chief Architect @ Nutonian (Machine Learning Apps XLST JIT Compiler Team at DataPower
  10. 10. © 2020 InfluxData. All rights reserved.10 Goals + Outline Goal: ⇒ Arrow is a good basis for a new (time series) Databases ❤️ ● Opinions and Perspectives of Databases ● Background on Arrow ● Arrow Examples, in Rust
  11. 11. © 2020 InfluxData. All rights reserved.11 Databases -- Trend Towards Specialization Relational Key-Value Timeseries Graph Array / Scientific Document Stream Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1 Data Model Deployment Embedded / Edge Cloud Single-Node Hybrid Ecosystem Hadoop Java Json / Javascript AWS GCP Azure Apple Cloud Use Case Transactions Analytics Streaming ...
  12. 12. © 2020 InfluxData. All rights reserved.12 … and our new database is … 🎉 InfluxDB IOx - The Future Core of InfluxDB Built with Rust and Arrow
  13. 13. © 2020 InfluxData. All rights reserved.13 Analytic Systems (vs Transactional) ● Transactional (OLTP, Key-value stores, etc) ○ Workload is “lookup a record by id”, “update a record”, “keep data durable and consistent” ○ Examples: Oracle, Postgres, Cassandra, DynamoDB, MongoDB, etc etc ● Analytic (OLAP, “Big Data”, etc) ○ Workload: aggregate many rows to get historical view, bulk loads, rarely updated ○ Examples: ClickHouse, MapReduce, Spark, Vertica, Pig, Hive, InfluxDB, etc etc
  14. 14. © 2020 InfluxData. All rights reserved.14 So, you want to build a new database… ? Databases need many features just to look like a database: ● Get Data In and Out ● Store Data and Catalog / Metadata ● Query Store: + Query Language ● Connect: Client API … Before you can invest in what makes your database special
  15. 15. © 2020 InfluxData. All rights reserved.15 Implementation timeline for a new Database system Client API In memory storage In-Memory filter + aggregation Durability / persistence Metadata Catalog + Management Query Language Parser Optimized / Compressed storage Execution on Compressed Data Joins! Additional Client Languages Outer Joins Subquery support More advanced analytics Cost based optimizer Out of core algorithms Storage Rearrangement Heuristic Query Planner Arithmetic expressions Date / time Expressions Concurrency Control Data Model / Type System Distributed query execution Resource Management “Lets Build a Database” 🤔 “Ok now this is pretty good” 😐 “Look mom! I have a database!” 😃 Online recovery
  16. 16. © 2020 InfluxData. All rights reserved.16 Arrow Project Goals “Build a better open source foundation for data science” How is this related to databases? https://arrow.apache.org/
  17. 17. © 2020 InfluxData. All rights reserved.17 Arrow == toolkit for a modern analytic databases match tool_needed { File Format (persistence) => Parquet Columnar memory representation => Arrow Arrays Operations (e.g. add, multiply) => Compute Kernels Network transfer => Arrow Flight IPC _ => ... to be continued ... }
  18. 18. © 2020 InfluxData. All rights reserved.18 InfluxDB line protocol weather,location=us-east temperature=82,humidity=67 1465839830100400200 weather,location=us-midwest temperature=82,humidity=65 1465839830100400200 weather,location=us-west temperature=70,humidity=54 1465839830100400200 weather,location=us-east temperature=83,humidity=69 1465839830200400200 weather,location=us-midwest temperature=87,humidity=78 1465839830200400200 weather,location=us-west temperature=72,humidity=56 1465839830200400200 weather,location=us-east temperature=84,humidity=67 1465839830300400200 weather,location=us-midwest temperature=90,humidity=82 1465839830400400200 weather,location=us-west temperature=71,humidity=57 1465839830400400200 Line Protocol Tutorial (link) Measurements Tags Fields Timestamp
  19. 19. © 2020 InfluxData. All rights reserved.19 IOx Data Model weather,location=us-east temperature=82,humidity=67 1465839830100400200 weather,location=us-midwest temperature=82,humidity=65 1465839830100400200 weather,location=us-west temperature=70,humidity=54 1465839830100400200 weather,location=us-east temperature=83,humidity=69 1465839830200400200 weather,location=us-midwest temperature=87,humidity=78 1465839830200400200 weather,location=us-west temperature=72,humidity=56 1465839830200400200 weather,location=us-east temperature=84,humidity=67 1465839830300400200 weather,location=us-midwest temperature=90,humidity=82 1465839830400400200 weather,location=us-west temperature=71,humidity=57 1465839830400400200 location "us-east" "us-midwest" "us-west" "us-east" "us-midwest" "us-west" "us-east" "us-midwest" "us-west" temperature 82 82 70 83 87 72 84 90 71 humidity 67 65 54 69 78 56 67 82 57 timestamp 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.3004002Z 2016-06-13T17:43:50.3004002Z 2016-06-13T17:43:50.3004002Z Table: weather
  20. 20. © 2020 InfluxData. All rights reserved.20 Code Examples Thesis: “When writing an analytic database, you will end up implementing the Arrow feature set” (Ecosystem integration is another major benefit of Arrow, subject of a future talk) + * Take performance comparisons with a large grain of salt Compare Plain Rust and Rust using the Arrow library
  21. 21. © 2020 InfluxData. All rights reserved.21 Motivating Example “Find the rows that are not in `us-west`”
  22. 22. © 2020 InfluxData. All rights reserved.22 Create the Array let string_vec: Vec<String> = (0..NUM_TAGS) .map(|i| { match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }.into() }) .collect(); let mut builder = StringBuilder::new(NUM_TAGS); (0..NUM_TAGS).enumerate() .for_each(|(i, _)| { let location = match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }; builder.append_value(location) .unwrap() }); let array = builder.finish(); > created array with 10000000 elements ~600ms > created array with 10000000 elements ~400ms +
  23. 23. © 2020 InfluxData. All rights reserved.23 Memory Footprint let size = size_of::<Vec<String>>() + string_vec .iter() .fold(0, |sz, s| { sz + size_of::<String>() + s.len() }); println!("total size: {} bytes", size); println!("total size: {} bytes", array.get_array_memory_size()); > total size: 320000023 bytes ~320 MB * > total size: 149206128 bytes ~150 MB +
  24. 24. © 2020 InfluxData. All rights reserved.24 Find Rows != “us-west” let not_west_bitset: Vec<bool> = string_vec .iter() .map(|s| s != "us-west") .collect(); let num_not_west = not_west_bitset .iter() .filter(|&&v| v) .count(); let not_west_bitset = neq_utf8_scalar( &array, "us-west" ).unwrap(); let num_not_west = not_west_bitset .iter() .filter(|v| matches!(v, Some(true))) .count(); > Found 6666667 not in west ~50ms > Found 6666667 not in west ~120ms +
  25. 25. © 2020 InfluxData. All rights reserved.25 Find Rows != “us-west” (with null handling) let string_vec: Vec<Option<String>> = ...; let not_west_bitset: Vec<bool> = string_vec .iter() .map(|s| { s.as_ref() .map(|s| s != "us-west") .unwrap_or(false) }) .collect(); let num_not_west = not_west_bitset .iter() .filter(|&&v| v) .count(); + Same as previous > Found 6666667 not in west ~50ms
  26. 26. © 2020 InfluxData. All rights reserved.26 Materialize rows for future processing let not_west: Vec<String> = not_west_bitset .iter() .enumerate() .filter_map(|(i, &v)| { if v { Some(string_vec[i].clone()) } else { None } }) .collect(); let not_west = filter( &array, &not_west_bitset ).unwrap(); > Made array of 6666667 Strings not in west ~450 ms > Made array of 6666667 Strings not in west ~50 ms +
  27. 27. © 2020 InfluxData. All rights reserved.27 More efficient encoding (dictionary) let vb = StringBuilder::new(); let kb = Int8Builder::new(); let mut builder = StringDictionaryBuilder::new(vb,kb); (0..NUM_TAGS) .enumerate() .for_each(|(i, _)| { let location = match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }; builder.append(location).unwrap(); }); let array = builder.finish(); > total size: 10000688 bytes 10MB 250 ms + dictionary "us-east" "us-midwest" "us-west" Location 0 1 2 0 1 2 0 1 2 [0] [1] [2] [u8]
  28. 28. © 2020 InfluxData. All rights reserved.28 SIMD Anyone? let output = gt( &left, &right ).unwrap(); +10 20 17 5 23 5 9 12 4 5 76 2 3 5 2 33 2 1 6 7 8 2 7 2 5 6 7 8 left right output 1 0 1 1 1 0 1 1 0 1 1 0 0 0 > > > >
  29. 29. © 2020 InfluxData. All rights reserved.29 SIMD Implementation #[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), feature = "simd"))] fn simd_compare_op<T, F>(left: &PrimitiveArray<T>, right: &PrimitiveArray<T>, op: F) -> Result<BooleanArray> where T: ArrowNumericType, F: Fn(T::Simd, T::Simd) -> T::SimdMask, { // use / error checking elided let null_bit_buffer = combine_option_bitmap( left.data_ref(), right.data_ref(), len )?; let lanes = T::lanes(); let mut result = MutableBuffer::new( left.len() * mem::size_of::<bool>() ); let rem = len % lanes; for i in (0..len - rem).step_by(lanes) { let simd_left = T::load(left.value_slice(i, lanes)); let simd_right = T::load(right.value_slice(i, lanes)); let simd_result = op(simd_left, simd_right); T::bitmask(&simd_result, |b| { result.write(b).unwrap(); }); } Source: arrow/src/compute/kernels/comparison.rs if rem > 0 { let simd_left = T::load(left.value_slice(len - rem, lanes)); let simd_right = T::load(right.value_slice(len - rem, lanes)); let simd_result = op(simd_left, simd_right); let rem_buffer_size = (rem as f32 / 8f32).ceil() as usize; T::bitmask(&simd_result, |b| { result.write(&b[0..rem_buffer_size]).unwrap(); }); } let data = ArrayData::new( DataType::Boolean, left.len(), None, null_bit_buffer, 0, vec![result.freeze()], vec![], ); Ok(PrimitiveArray::<BooleanType>::from(Arc::new(data))) }
  30. 30. © 2020 InfluxData. All rights reserved.30 Other things needed in a database Vec<Option<String>> to support nulls Handle other data types with same code Vectorized implementations of filter, aggregate, etc Persist it to storage Send data over the network Ecosystem compatibility ...
  31. 31. © 2020 InfluxData. All rights reserved.31 Rust / Arrow Community: Good and Getting better Major Roadmap Items (see also Apache Arrow (Rust) 2.0.0) 1. Support Stable Rust 2. Improved DictionaryArray support and performance 3. Improved compute kernel performance 4. SQL: Joins 5. Parallel CPU-bound operations; Additional platform support (e.g. ARMv8)InfluxData specifically is investing in: 1. Flight IPC 2. Improved Dictionary and Date/Time support 3. Data Fusion (some other tech talk)
  32. 32. © 2020 InfluxData. All rights reserved.32 Thank You Find us online Github: https://github.com/influxdata/influxdb_iox Slack: https://influxdata.com/slack It is early days; there are many cool things left to implement And we are hiring (Senior IOx Engineer Job Posting)

×