Submit Search
Upload
Kudu: Fast Analytics on Fast Data
•
2 likes
•
448 views
M
michaelguia
Follow
Slides from a tech talk by Jean-Daniel Cryans for SF Spark Hackers
Read less
Read more
Engineering
Report
Share
Report
Share
1 of 52
Download now
Download to read offline
Recommended
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
Introducing Kudu
Introducing Kudu
Jeremy Beard
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
Apache kudu
Apache kudu
Asim Jalis
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Dataconomy Media
Recommended
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
Introducing Kudu
Introducing Kudu
Jeremy Beard
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
Apache kudu
Apache kudu
Asim Jalis
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Dataconomy Media
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
jdcryans
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
High concurrency, Low latency analytics using Spark/Kudu
High concurrency, Low latency analytics using Spark/Kudu
Chris George
Introduction to Apache Kudu
Introduction to Apache Kudu
Shravan (Sean) Pabba
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
Kudu demo
Kudu demo
Hemanth Kumar Ratakonda
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
Hive vs. Impala
Hive vs. Impala
Omid Vahdaty
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
Cloudera, Inc.
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
How Impala Works
How Impala Works
Yue Chen
Architecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
Introduction to Impala
Introduction to Impala
markgrover
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
More Related Content
What's hot
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
jdcryans
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
High concurrency, Low latency analytics using Spark/Kudu
High concurrency, Low latency analytics using Spark/Kudu
Chris George
Introduction to Apache Kudu
Introduction to Apache Kudu
Shravan (Sean) Pabba
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
Kudu demo
Kudu demo
Hemanth Kumar Ratakonda
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
Hive vs. Impala
Hive vs. Impala
Omid Vahdaty
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
Cloudera, Inc.
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
How Impala Works
How Impala Works
Yue Chen
Architecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
Introduction to Impala
Introduction to Impala
markgrover
What's hot
(20)
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
High concurrency, Low latency analytics using Spark/Kudu
High concurrency, Low latency analytics using Spark/Kudu
Introduction to Apache Kudu
Introduction to Apache Kudu
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Kudu demo
Kudu demo
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Hive vs. Impala
Hive vs. Impala
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
How Impala Works
How Impala Works
Architecting Applications with Hadoop
Architecting Applications with Hadoop
Introduction to Impala
Introduction to Impala
Similar to Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
SFHUG Kudu Talk
SFHUG Kudu Talk
Felicia Haggarty
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
Cloudera Japan
What's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
Manish Maheshwari
Strata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
Manish Maheshwari
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Timothy Spann
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Cloudera, Inc.
End to End Streaming Architectures
End to End Streaming Architectures
Cloudera, Inc.
A5 oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...
Dr. Wilfred Lin (Ph.D.)
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
Ashish Mrig
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
Similar to Kudu: Fast Analytics on Fast Data
(20)
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
SFHUG Kudu Talk
SFHUG Kudu Talk
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
What's New in Apache Hive
What's New in Apache Hive
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
End to End Streaming Architectures
End to End Streaming Architectures
A5 oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Recently uploaded
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
bhaskargani46
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
Call Girls in Nagpur High Profile Call Girls
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Call Girls in Nagpur High Profile
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Christo Ananth
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
KreezheaRecto
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
tanu pandey
Online banking management system project.pdf
Online banking management system project.pdf
Kamal Acharya
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
Thermal Engineering Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
DineshKumar4165
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
M Maged Hegazy, LLM, MBA, CCP, P3O
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
mulugeta48
NFPA 5000 2024 standard .
NFPA 5000 2024 standard .
DerechoLaboralIndivi
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
SUHANI PANDEY
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
ranjana rawat
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
DineshKumar4165
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
Recently uploaded
(20)
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Online banking management system project.pdf
Online banking management system project.pdf
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
Thermal Engineering Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
NFPA 5000 2024 standard .
NFPA 5000 2024 standard .
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Kudu: Fast Analytics on Fast Data
1.
1© Cloudera, Inc.
All rights reserved. Jean-‐Daniel Cryans – jdcryans@cloudera.com @jdcryans Kudu: Fast Analytics on Fast Data
2.
2© Cloudera, Inc.
All rights reserved. Why Kudu? 2
3.
3© Cloudera, Inc.
All rights reserved. Current Storage Landscape in Hadoop Ecosystem HDFS (GFS) excels at: • Batch ingest only (eg hourly) • Efficiently scanning large amounts of data (analytics) HBase (BigTable) excels at: • Efficiently finding and writing individual rows • Making data mutable Gaps exist when these properties are needed simultaneously
4.
4© Cloudera, Inc.
All rights reserved. • High throughput for big scans Goal: Within 2x of Parquet • Low-‐latency for short accesses Goal: 1ms read/write on SSD • Database-‐like semantics (initially single-‐row ACID) • Relational data model • SQL queries are easy • “NoSQL” style scan/insert/update (Java/C++ client) Kudu Design Goals
5.
5© Cloudera, Inc.
All rights reserved. Changing Hardware landscape • Spinning disk -‐> solid state storage • NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput,at a price of less than $3/GB and dropping • 3D XPoint memory (1000x faster than NAND, cheaper than RAM) • RAM is cheaper and more abundant: • 64-‐>128-‐>256GB over last few years • Takeaway: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind.
6.
6© Cloudera, Inc.
All rights reserved. What’s Kudu? 6
7.
7© Cloudera, Inc.
All rights reserved. Scalable and fast tabular storage • Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes, tens of PBs • Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node • Tabular • Store tables like a normal database (can support SQL, Spark, etc) • Individual record-‐level access to 100+ billion row tables (Java/C++/Python APIs)
8.
8© Cloudera, Inc.
All rights reserved. Kudu Usage • Table has a SQL-‐like schema • Finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-‐composite primary key • Fast ALTER TABLE • Java and C++ “NoSQL” style APIs • Insert(), Update(), Delete(), Scan() • Integrations with MapReduce, Spark, and Impala • Apache Drill work-‐in-‐progress 8
9.
9© Cloudera, Inc.
All rights reserved. Use cases and architectures
10.
10© Cloudera, Inc.
All rights reserved. Kudu Use Cases Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes ● Time Series ○ Examples: Stream market data; fraud detection & prevention; risk monitoring ○ Workload: Insert, updates, scans, lookups ● Machine Data Analytics ○ Examples: Network threat detection ○ Workload: Inserts, scans, lookups ● Online Reporting ○ Examples: ODS ○ Workload: Inserts, updates, scans, lookups
11.
11© Cloudera, Inc.
All rights reserved. Real-‐Time Analytics in Hadoop Today Fraud Detection in the Real World = Storage Complexity Considerations: ● How do I handle failure during this process? ● How often do I reorganize data streaming in into a format appropriate for reporting? ● When reporting, how do I see data that has not yet been reorganized? ● How do I ensure that important jobs aren’t interrupted by maintenance? New Partition Most Recent Partition Historic Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Incoming Data (Messaging System) Reporting Request Impala on HDFS
12.
12© Cloudera, Inc.
All rights reserved. Real-‐Time Analytics in Hadoop with Kudu Improvements: ● One system to operate ● No cron jobs or background processes ● Handle late arrivals or data corrections with ease ● New data available immediately for analytics or operations Historical and Real-‐time Data Incoming Data (Messaging System) Reporting Request Storage in Kudu
13.
13© Cloudera, Inc.
All rights reserved. Xiaomi use case • World’s 4th largest smart-‐phone maker (most popular in China) • Gather important RPC tracing events from mobile app and backend service. • Service monitoring & troubleshooting tool. u High write throughput • >5 Billion records/day and growing u Query latest data and quick response • Identify and resolve issues quickly u Can search for individual records • Easy for troubleshooting
14.
14© Cloudera, Inc.
All rights reserved. Xiaomi Big Data Analytics Pipeline Before Kudu • Long pipeline high latency(1 hour ~ 1 day), data conversion pains • No ordering Log arrival(storage) order not exactly logical order e.g. read 2-‐3 days of log for data in 1 day
15.
15© Cloudera, Inc.
All rights reserved. Xiaomi Big Data Analysis Pipeline Simplified With Kudu • ETL Pipeline(0~10s latency) Apps that need to prevent backpressure or require ETL • Direct Pipeline(no latency) Apps that don’t require ETL and no backpressure issues OLAP scan Side table lookup Result store
16.
16© Cloudera, Inc.
All rights reserved. How it works Replication and fault tolerance 16
17.
17© Cloudera, Inc.
All rights reserved. Tables and Tablets • Table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • bucketNumber = hashCode(row[‘timestamp’]) % 100 • Each tablet has N replicas (3 or 5), with Raft consensus • Tablet servers host tablets on local disk drives 17
18.
18© Cloudera, Inc.
All rights reserved. Metadata • Replicated master • Acts as a tablet directory • Acts as a catalog (which tables exist, etc) • Acts as a load balancer (tracks TS liveness, re-‐replicates under-‐replicated tablets) • Caches all metadata in RAM for high performance • Client configured with master addresses • Asks master for tablet locations as needed and caches them 18
19.
19© Cloudera, Inc.
All rights reserved. Client Hey Master! Where is the row for ‘tlipcon’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … UPDATE tlipcon SET col=foo Meta Cache T1: … T2: … T3: …
20.
20© Cloudera, Inc.
All rights reserved. Raft consensus 20 TS A Tablet 1 (LEADER) Client TS B Tablet 1 (FOLLOWER) TS C Tablet 1 (FOLLOWER) WAL WALWAL 2b. Leader writes local WAL 1a. Client-‐>Leader: Write() RPC 2a. Leader-‐>Followers: UpdateConsensus() RPC 3. Follower: write WAL 4. Follower-‐>Leader: success 3. Follower: write WAL 5. Leader has achieved majority 6. Leader-‐>Client: Success!
21.
21© Cloudera, Inc.
All rights reserved. Fault tolerance • Transient failures: • Follower: Leader can still achieve majority • Leader: automatic leader election (~5 seconds) • Restart dead TS within 5 min and it will rejoin transparently • Permanent failures • After 5 minutes, automatically creates a new follower replica and copies data • N replicas handle (N-‐1)/2 failures 21
22.
22© Cloudera, Inc.
All rights reserved. How it works Columnar storage 22
23.
23© Cloudera, Inc.
All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text
24.
24© Cloudera, Inc.
All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; Only read 1 column 1GB 2GB 1GB 200GB
25.
25© Cloudera, Inc.
All rights reserved. Columnar compression {1442865158, 1442828307, 1442865156, 1442865155} Created_at Created_at Diff(created_at) 1442865158 n/a 1442828307 -‐36851 1442865156 36849 1442865155 -‐1 64 bits each 17 bits each • Many columns can compress to a few bits per row! • Especially: • Timestamps • Time series values • Low-‐cardinality strings • Massive space savings and throughput increase!
26.
26© Cloudera, Inc.
All rights reserved. Handling inserts and updates • Inserts go to an in-‐memory row store (MemRowSet) • Durable due to write-‐ahead logging • Later flush to columnar format on disk • Updates go to in-‐memory “delta store” • Later flush to “delta files” on disk • Eventually “compact” into the previously-‐written columnar data files • Details elided here due to time constraints • available in other slide decks online, or come find me later to learn more!
27.
27© Cloudera, Inc.
All rights reserved. Integrations
28.
28© Cloudera, Inc.
All rights reserved. Spark DataSource integration (WIP) sqlContext.load("org.kududb.spark", Map("kudu.table" -> “foo”, "kudu.master" -> “master.example.com”)) .registerTempTable(“mytable”) df = sqlContext.sql( “select col_a, col_b from mytable “ + “where col_c = 123”) https://github.com/tmalaska/SparkOnKudu
29.
29© Cloudera, Inc.
All rights reserved. Impala integration • CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16 BUCKETS AS SELECT … FROM … • INSERT/UPDATE/DELETE • Optimizations like predicate pushdown, scan parallelism, more on the way • Not an Impala user? Apache Drill integration in the works by Dremio
30.
30© Cloudera, Inc.
All rights reserved. MapReduce integration • Most Kudu integration/correctness testing via MapReduce • Multi-‐framework cluster (MR + HDFS + Kudu on the same disks) • KuduTableInputFormat / KuduTableOutputFormat •Support for pushing predicates, column projections, etc
31.
31© Cloudera, Inc.
All rights reserved. Performance 31
32.
32© Cloudera, Inc.
All rights reserved. TPC-‐H (Analytics benchmark) • 75 server cluster • 12 (spinning) disk each, enough RAM to fit dataset • TPC-‐H Scale Factor 100 (100GB) • Example query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc; 32
33.
33© Cloudera, Inc.
All rights reserved. -‐ Kudu outperforms Parquet by 31% (geometric mean) for RAM-‐resident data
34.
34© Cloudera, Inc.
All rights reserved. Versus other NoSQL storage • Phoenix: SQL layer on HBase • 10 node cluster (9 worker, 1 master) • TPC-‐H LINEITEM table only (6B rows) 34 2152 219 76 131 0.04 1918 13.2 1.7 0.7 0.15 155 9.3 1.4 1.5 1.37 0.01 0.1 1 10 100 1000 10000 Load TPCH Q1 COUNT(*) COUNT(*) WHERE… single-‐row lookup Time (sec) Phoenix Kudu Parquet
35.
35© Cloudera, Inc.
All rights reserved. Xiaomi benchmark • 6 real queries from application trace analysis application • Q1: SELECT COUNT(*) • Q2: SELECT hour, COUNT(*) WHERE module = ‘foo’ GROUP BY HOUR • Q3: SELECT hour, COUNT(DISTINCT uid) WHERE module = ‘foo’ AND app=‘bar’ GROUP BY HOUR • Q4: analytics on RPC success rate over all data for one app • Q5: same as Q4, but filter by time range • Q6: SELECT * WHERE app = … AND uid = … ORDER BY ts LIMIT 30 OFFSET 30
36.
36© Cloudera, Inc.
All rights reserved. Xiaomi: Benchmark Results 1.4 2.0 2.3 3.1 1.3 0.9 1.3 2.8 4.0 5.7 7.5 16.7 Q1 Q2 Q3 Q4 Q5 Q6 kudu parquet Query latency: * HDFS parquet file replication = 3 , kudu table replication = 3 * Each query run 5 times then take average
37.
37© Cloudera, Inc.
All rights reserved. What about NoSQL-‐style random access? (YCSB) • YCSB 0.5.0-‐snapshot • 10 node cluster (9 worker, 1 master) • 100M row data set • 10M operations each workload 37
38.
38© Cloudera, Inc.
All rights reserved. Getting started 38
39.
39© Cloudera, Inc.
All rights reserved. Project status • Open source beta released in September 2015 • First update version 0.6.0 released last month • Usable for many applications • Have not experienced data loss, reasonably stable (almost no crashes reported) • Still requires some expert assistance, and you’ll probably find some bugs • Officially now known as Apache Kudu(incubating)!
40.
40© Cloudera, Inc.
All rights reserved. Kudu Community Your company here!
41.
41© Cloudera, Inc.
All rights reserved. Getting started as a user • http://getkudu.io • kudu-‐user@googlegroups.com • http://getkudu-‐slack.herokuapp.com/ • Quickstart VM • Easiest way to get started • Impala and Kudu in an easy-‐to-‐install VM • CSD and Parcels • For installation on a Cloudera Manager-‐managed cluster 41
42.
42© Cloudera, Inc.
All rights reserved. Getting started as a developer • http://github.com/cloudera/kudu • All commits go here first • Public gerrit: http://gerrit.cloudera.org • All code reviews happening here • Public JIRA: http://issues.cloudera.org • Includes bugs going back to 2013. Come see our dirty laundry! • kudu-‐dev@googlegroups.com • Apache 2.0 license open source • Contributions are welcome and encouraged! 42
43.
43© Cloudera, Inc.
All rights reserved. http://getkudu.io/ @apachekudu
44.
44© Cloudera, Inc.
All rights reserved. How it works Write and read paths 44
45.
45© Cloudera, Inc.
All rights reserved. Kudu storage – Inserts and Flushes 45 MemRowSet INSERT(“todd”, “$1000”,”engineer”) name pay role DiskRowSet 1 flush
46.
46© Cloudera, Inc.
All rights reserved. Kudu storage – Inserts and Flushes 46 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 INSERT(“doug”, “$1B”, “Hadoop man”) flush
47.
47© Cloudera, Inc.
All rights reserved. Kudu storage -‐ Updates 47 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS Each DiskRowSet has its own DeltaMemStore to accumulate updates base data base data
48.
48© Cloudera, Inc.
All rights reserved. Kudu storage -‐ Updates 48 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS UPDATE set pay=“$1M” WHERE name=“todd” Is the row in DiskRowSet 2? (check bloom filters) Is the row in DiskRowSet 1? (check bloom filters) Bloom says: no! Bloom says: maybe! Search key column to find offset: rowid = 150 150: pay=$1M @ time T base data
49.
49© Cloudera, Inc.
All rights reserved. Kudu storage – Read path 49 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 150: pay=$1M @ time T Read rows in DiskRowSet 2 Then, read rows in DiskRowSet 1 Updates are applied based on ordinal offset within DRS: array indexing = fast base data base data
50.
50© Cloudera, Inc.
All rights reserved. Kudu storage – Delta flushes 50 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 150: pay=$1M @ time T REDO DeltaFile Flush A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later version base data base data
51.
51© Cloudera, Inc.
All rights reserved. Kudu storage – Major delta compaction 51 name pay role DiskRowSet(pre-‐compaction) Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile Many deltas accumulate: lots of delta application work on reads Merge updates for columns with high update percentage name pay role DiskRowSet(post-‐compaction) Delta MS Unmerged REDO deltasUNDO deltas base data If a column has few updates, doesn’t need to be re-‐ written: those deltas maintained in new DeltaFile
52.
52© Cloudera, Inc.
All rights reserved. Kudu storage – RowSet Compactions DRS 1 (32MB) [PK=alice], [PK=joe], [PK=linda], [PK=zach] DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke] DRS 3 (32MB) [PK=carl], [PK=julie], [PK=omar] [PK=zoe] DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, joe] [jon, julie, linda, mary] [omar, zach, zeke, zoe] Reorganize rows to avoid rowsets with overlapping key ranges
Download now