SlideShare une entreprise Scribd logo
1  sur  66
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & others
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning
© 2014 MapR Technologies 5
What is Drill?
© 2014 MapR Technologies 6
A Query engine that has…
• Columnar/Vectorized
• Optimistic/pipelined
• Runtime compilation
• Late binding
• Extensible
© 2014 MapR Technologies 7
Table Can Be an Entire Directory Tree
// On a file
select errorLevel, count(*)
from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet`
group by errorLevel;
// On the entire data collection: all years, all months
select errorLevel, count(*)
from dfs.logs.`/AppServerLogs`
group by errorLevel;
© 2014 MapR Technologies 8
Basic Process
Zookeepe
r
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node
c c c
© 2014 MapR Technologies 9
Stages of Query Planning
Parser
Logical
Planner
Physical
Planner
Query
Foreman
Plan
fragments
sent to drill
bits
SQL
Query
Heuristic and
cost based
Cost based
© 2014 MapR Technologies 10
Query Execution
SQL Parser
Optimizer
Scheduler
Pig Parser
PhysicalPlan
Mongo
Cassandra
HiveQL
Parser
RPC Endpoint
Distributed Cache
StorageInterface
OperatorsOperators
Foreman
LogicalPlan
HDFS
HBase
JDBC
Endpoint
ODBC
Endpoint
© 2014 MapR Technologies 11
Batches of Values
• Value vectors
– List of values, with same schema
– With the 4-value semantics for each value
• Shipped around in batches
– max 256k bytes in a batch
– max 64K rows in a batch
• RPC designed for multiple replies to a request
© 2014 MapR Technologies 12
Fixed Value Vectors
© 2014 MapR Technologies 13
Vectorization
• Drill operates on more than one record at a time
– Word-sized manipulations
– SIMD instructions
• GCC, LLVM and JVM all do various optimizations automatically
– Manually code algorithms
• Logical Vectorization
– Bitmaps allow lightning fast null-checks
– Avoid branching to speed CPU pipeline
© 2014 MapR Technologies 14
Runtime Compilation is Faster
• JIT is smart, but more gains with runtime compilation
• Janino: Java-based Java compiler
From http://bit.ly/16Xk32x
© 2014 MapR Technologies 15
Drill compiler
Loaded class
Merge byte-
code of the
two classes
Janino
compiles
runtime
byte-code
CodeModel
generates
code
Precompiled
byte-code
templates
© 2014 MapR Technologies 16
Optimistic
0
20
40
60
80
100
120
140
160
cmd pipeline small db med db large db dw compilation hadoop
Speed vs. check-pointing
No need to checkpoint
Checkpoint frequentlyApache Drill
© 2014 MapR Technologies 17
Optimistic Execution
• Recovery code trivial
– Running instances discard the failed query’s intermediate state
• Pipelining possible
– Send results as soon as batch is large enough
– Requires barrier-less decomposition of query
© 2014 MapR Technologies 18
Pipelining
• Record batches are pipelined between
nodes
– ~256kB usually
• Unit of work for Drill
– Operators works on a batch
• Operator reconfiguration happens at
batch boundaries
DrillBit
DrillBit DrillBit
© 2014 MapR Technologies 19
Pipelining
• Random access: sort without copy or restructuring
• Avoids serialization/deserialization
• Off-heap (no GC woes when lots of memory)
• Read/write to disk
– when data larger than memory
Drill Bit
Memory
overflow
uses disk
Disk
© 2014 MapR Technologies 20
Cost-based Optimization
• Using Optiq, an extensible framework
– Pluggable rules, and cost model
• Rules for distributed plan generation
– Insert Exchange operator into physical plan
– Optiq enhanced to explore parallel query plans
• Pluggable cost model
– CPU, IO, memory, network cost (data locality)
– Storage engine features (HDFS vs HIVE vs HBase)
Query
Optimizer
Pluggable
rules
Pluggable
cost model
© 2014 MapR Technologies 21
What is SparkSQL?
© 2014 MapR Technologies 22
What is Spark SQL
• Essentially syntactic sugar over a limited subset of Spark
• Inherits all the virtues (and vices) of Spark
– Lambdas can serve as UDFs (has subtle issues for performance)
• Inputs have to be loaded
– Perhaps lazily, not obvious when load actually happens
• Not designed as a streaming engine, requires more memory
• Some JSON support, but not so much for large or variable
objects
• Embedded in a real language!
© 2014 MapR Technologies 23
In More Detail
• A Spark program consists of a computation graph that consumes
and produces so-called resilient data datasets
• SparkSQL allows these computations to be defined using SQL
(but needs schema definitions on the RDD’s)
• Conventional Spark programs and SparkSQL programs
interoperate nearly seamlessly
© 2014 MapR Technologies 24
Many Similarities
SQL Parser
Optimizer
Java
PhysicalPlan
Scala
LogicalPlan
Python
group
filter
filter
© 2014 MapR Technologies 25
Important Differences
• Spark execution assumes RDD’s are complete representation,
not a stream of row batches
• Input sources don’t inject optimization rules, nor expose detailed
cost models
• Most RDD’s don’t have a zero-copy capability
• Spark inherits JVM memory model, very limited use of off-heap
© 2014 MapR Technologies 26
scala> sqlContext.sql("select * from json.`foo.json`").show
+---+------+----+
| a| b| c|
+---+------+----+
| 3|[3, 2]| xyz|
| 7| null| wxy|
| 7| []|null|
+---+------+----+
© 2014 MapR Technologies 27
scala> sqlContext.sql(
"select a, explode(b) b_v from json.`bug.json`"
).show
+---+---------+
| a| b_v|
+---+---------+
| 3| 3|
| 3| 2|
+---+---------+
© 2014 MapR Technologies 28
First Synthesis
• Drill has a more nuanced optimizer, better code generation
– This often leads to ~2x speed advantage
• Drill has ValueVector and row batches
– This leads to much less memory pressure
• Drill has much stricter memory life-cycle
– Query and done and gone, no need for big GC’s even on big memory
• Drill is all about SQL execution
© 2014 MapR Technologies 29
But …
• Spark can optimize across entire program
– This often leads to ~2x speed advantage
• Spark has much more flexible memory structures
– This can lead to much less memory pressure
• Spark has much more flexible RDD life-cycle
– RDD’s can be cached, persisted or simply recomputed as necessary
• Spark is not all about SQL execution
© 2014 MapR Technologies 30
The Really Big Differences
• Drill focuses heavily on secure, multi-tenant access to data
– Strong impersonation semantics
– Cascading rights via views
– Queries co-exist in a cluster and reserve only their momentary resource
requirements
• Spark focuses heavily on fully integrated execution models
– Any spark function works with (almost) any RDD’s
– Memory residency of RDD’s is the highest goal
© 2014 MapR Technologies 31
Drill security
➢ End to end security from
BI tools to Hadoop
➢ Standard based PAM
Authentication
➢ 2 level user
Impersonation
➢ Fine-grained row and
column level access
control with Drill Views –
no centralized security
repository required
© 2014 MapR Technologies 32
Granular security permissions through Drill views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.view.drill)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists
© 2014 MapR Technologies 33
Ownership Chaining
• Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path
© 2014 MapR Technologies 34
But was that the right
question?
© 2014 MapR Technologies 35
Unification is Feasible
• It is relatively easy to build a DrillContext in Spark
– compare to SqlContext
• Define Datasets as Drill data sources and sinks
– Drill runs at the same time as Spark
• Orchestrate transport of Spark data to/from Drill
• Cost of transport is remarkably small
© 2014 MapR Technologies 36
What does the Spark and Drill integration look like
Features at a glance:
• Use Drill as an input to Spark
• Query Spark RDDs via Drill and create data pipelines
Disk (DFS)
Memory
RDD
Files Files
© 2014 MapR Technologies 37
Is unification
valuable?
© 2014 MapR Technologies 38
Example of Unification
Callers
Universe
Towers
cdr data
© 2014 MapR Technologies 39
Simple Session Protocol
• Calls started at random
intervals
• During calls, reconnection
is done periodically
idle
connect
HELLO
FAIL
TIME
OUT
active
END
CONNECT
END
HELLO
start
SETUP
• Many log events are buffered
and sent to current tower during
active state
© 2014 MapR Technologies 40
The Resulting Data
• Signal strength reports
– Tower, timestamp, rank, caller, caller location*, signal strength
• Tower log events: HELLO, FAIL, CONNECT, END
• Call end
• Note that data for one tower is often received by another due to
caller buffering to diagnostic data
*Location isn’t quite location … poetic license applied for
© 2014 MapR Technologies 41
What can we do with it?
© 2014 MapR Technologies 42
Baby Steps
• What does signal propagation look like?
select x, y, signal from cdr_stream where tower = 3
• Plot results to get a map of signal strength around a tower
© 2014 MapR Technologies 43
Baby Steps
• What does tower coverage look like?
select x, y from cdr_stream
where tower = 3 and event_type = ‘CONNECT’.
• Plot results to get a map of coverage area for a tower
© 2014 MapR Technologies 44
What about anomaly detection?
© 2014 MapR Technologies 45
Detecting Tower Loss
It’s important to know if traffic is stopped or delayed
because of a problem…
But events from towers come at irregular intervals
How long after the last event should you begin to worry?
© 2014 MapR Technologies 46
Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
– This shows up as a change in interval
• Want alert as soon as possible
© 2014 MapR Technologies 47
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2014 MapR Technologies 48
But in the real world, event
rates often change
© 2014 MapR Technologies 49
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2014 MapR Technologies 50
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2014 MapR Technologies 51
After Rate Correction
0 1 2 3 4
0246810
t (days)
dt/rate
99.9%−ile
99.99%−ile
© 2014 MapR Technologies 52
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2014 MapR Technologies 53
Propagation Anomalies
• What happens when something shadows part of the coverage
field?
– Can happen in urban areas with a construction crane
• Can solve heuristically
– Subtract from reference image composed by long term averages
– Doesn’t deal well with weak signal regions and low S/N
• Can solve probabilistically
– Compute anomaly for each measurement, use mean of log(p)
© 2014 MapR Technologies 54
© 2014 MapR Technologies 55
© 2014 MapR Technologies 56
Variable Signal/Noise Makes Heuristic Tricky
Far from the transmitter,
received signal is dominated by
noise. This makes subtraction of
average value a bad algorithm.
© 2014 MapR Technologies 57
Other Issues
• Finding anomalies in coverage area is similar tricky
• Coverage area is roughly where tower signal strength is higher
than neighbors
• Except for fuzziness due to hand-off delays
• Except for bias due to large-scale caller motions
– Rush hour
– Event mobs
© 2014 MapR Technologies 58
Simple Answer for Propagation Anomalies
• Cluster signal strength reports
• Cluster locations using k-means, large k
• Model report rate anomaly using discrete event models
• Model signal strength anomaly using percentile model
• Trade larger k against higher report rates, faster detection
• Overall anomaly is sum of individual log(p) anomalies
© 2014 MapR Technologies 59
Coverage Areas
© 2014 MapR Technologies 60
Just One Tower
© 2014 MapR Technologies 61
Cluster Reports for That Tower
© 2014 MapR Technologies 62
Cluster Reports for That Tower
1
2 3
4
5
6
7
8
9
© 2014 MapR Technologies 63
General Dataflow
Group by tower,
filter data (SQL)
k-means cluster
(ML LIB)
Split data
(SQL)
Location model
(Java)
Mark cluster
(ML LIB)
Rate detection
per cluster
© 2014 MapR Technologies 64
Summary
• Drill and Spark provide healthy competition in Apache
• Over time, they have converged in many respects
– But important distinctions remain
• Projects can work together to share key technology
– Apache Arrow … started as off-shoot of Drill, now has >12 major
projects as participants, including Spark
• Systems can work together even more deeply
– DrillContext makes integration first class
© 2014 MapR Technologies 65
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 66
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
© 2014 MapR Technologies 67
Thank you for coming today!
© 2014 MapR Technologies 68
…helping you put data technology to work
● Find answers
● Ask technical questions
● Join on-demand training course
discussions
● Follow release announcements
● Share and vote on product ideas
● Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com

Contenu connexe

Tendances

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsDataWorks Summit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 

Tendances (20)

Spark sql
Spark sqlSpark sql
Spark sql
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Apache hive
Apache hiveApache hive
Apache hive
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 

Similaire à Spark SQL versus Apache Drill: Different Tools with Different Rules

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Grayharryvanhaaren
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksMapR Technologies
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersJulien Anguenot
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down InternetMapR Technologies
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 

Similaire à Spark SQL versus Apache Drill: Different Tools with Different Rules (20)

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developers
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 

Plus de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Dernier

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Dernier (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Spark SQL versus Apache Drill: Different Tools with Different Rules

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & others VP of Incubator at Apache Foundation Email tdunning@apache.org tdunning@maprtech.com Twitter @ted_dunning
  • 3. © 2014 MapR Technologies 5 What is Drill?
  • 4. © 2014 MapR Technologies 6 A Query engine that has… • Columnar/Vectorized • Optimistic/pipelined • Runtime compilation • Late binding • Extensible
  • 5. © 2014 MapR Technologies 7 Table Can Be an Entire Directory Tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel;
  • 6. © 2014 MapR Technologies 8 Basic Process Zookeepe r DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node c c c
  • 7. © 2014 MapR Technologies 9 Stages of Query Planning Parser Logical Planner Physical Planner Query Foreman Plan fragments sent to drill bits SQL Query Heuristic and cost based Cost based
  • 8. © 2014 MapR Technologies 10 Query Execution SQL Parser Optimizer Scheduler Pig Parser PhysicalPlan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache StorageInterface OperatorsOperators Foreman LogicalPlan HDFS HBase JDBC Endpoint ODBC Endpoint
  • 9. © 2014 MapR Technologies 11 Batches of Values • Value vectors – List of values, with same schema – With the 4-value semantics for each value • Shipped around in batches – max 256k bytes in a batch – max 64K rows in a batch • RPC designed for multiple replies to a request
  • 10. © 2014 MapR Technologies 12 Fixed Value Vectors
  • 11. © 2014 MapR Technologies 13 Vectorization • Drill operates on more than one record at a time – Word-sized manipulations – SIMD instructions • GCC, LLVM and JVM all do various optimizations automatically – Manually code algorithms • Logical Vectorization – Bitmaps allow lightning fast null-checks – Avoid branching to speed CPU pipeline
  • 12. © 2014 MapR Technologies 14 Runtime Compilation is Faster • JIT is smart, but more gains with runtime compilation • Janino: Java-based Java compiler From http://bit.ly/16Xk32x
  • 13. © 2014 MapR Technologies 15 Drill compiler Loaded class Merge byte- code of the two classes Janino compiles runtime byte-code CodeModel generates code Precompiled byte-code templates
  • 14. © 2014 MapR Technologies 16 Optimistic 0 20 40 60 80 100 120 140 160 cmd pipeline small db med db large db dw compilation hadoop Speed vs. check-pointing No need to checkpoint Checkpoint frequentlyApache Drill
  • 15. © 2014 MapR Technologies 17 Optimistic Execution • Recovery code trivial – Running instances discard the failed query’s intermediate state • Pipelining possible – Send results as soon as batch is large enough – Requires barrier-less decomposition of query
  • 16. © 2014 MapR Technologies 18 Pipelining • Record batches are pipelined between nodes – ~256kB usually • Unit of work for Drill – Operators works on a batch • Operator reconfiguration happens at batch boundaries DrillBit DrillBit DrillBit
  • 17. © 2014 MapR Technologies 19 Pipelining • Random access: sort without copy or restructuring • Avoids serialization/deserialization • Off-heap (no GC woes when lots of memory) • Read/write to disk – when data larger than memory Drill Bit Memory overflow uses disk Disk
  • 18. © 2014 MapR Technologies 20 Cost-based Optimization • Using Optiq, an extensible framework – Pluggable rules, and cost model • Rules for distributed plan generation – Insert Exchange operator into physical plan – Optiq enhanced to explore parallel query plans • Pluggable cost model – CPU, IO, memory, network cost (data locality) – Storage engine features (HDFS vs HIVE vs HBase) Query Optimizer Pluggable rules Pluggable cost model
  • 19. © 2014 MapR Technologies 21 What is SparkSQL?
  • 20. © 2014 MapR Technologies 22 What is Spark SQL • Essentially syntactic sugar over a limited subset of Spark • Inherits all the virtues (and vices) of Spark – Lambdas can serve as UDFs (has subtle issues for performance) • Inputs have to be loaded – Perhaps lazily, not obvious when load actually happens • Not designed as a streaming engine, requires more memory • Some JSON support, but not so much for large or variable objects • Embedded in a real language!
  • 21. © 2014 MapR Technologies 23 In More Detail • A Spark program consists of a computation graph that consumes and produces so-called resilient data datasets • SparkSQL allows these computations to be defined using SQL (but needs schema definitions on the RDD’s) • Conventional Spark programs and SparkSQL programs interoperate nearly seamlessly
  • 22. © 2014 MapR Technologies 24 Many Similarities SQL Parser Optimizer Java PhysicalPlan Scala LogicalPlan Python group filter filter
  • 23. © 2014 MapR Technologies 25 Important Differences • Spark execution assumes RDD’s are complete representation, not a stream of row batches • Input sources don’t inject optimization rules, nor expose detailed cost models • Most RDD’s don’t have a zero-copy capability • Spark inherits JVM memory model, very limited use of off-heap
  • 24. © 2014 MapR Technologies 26 scala> sqlContext.sql("select * from json.`foo.json`").show +---+------+----+ | a| b| c| +---+------+----+ | 3|[3, 2]| xyz| | 7| null| wxy| | 7| []|null| +---+------+----+
  • 25. © 2014 MapR Technologies 27 scala> sqlContext.sql( "select a, explode(b) b_v from json.`bug.json`" ).show +---+---------+ | a| b_v| +---+---------+ | 3| 3| | 3| 2| +---+---------+
  • 26. © 2014 MapR Technologies 28 First Synthesis • Drill has a more nuanced optimizer, better code generation – This often leads to ~2x speed advantage • Drill has ValueVector and row batches – This leads to much less memory pressure • Drill has much stricter memory life-cycle – Query and done and gone, no need for big GC’s even on big memory • Drill is all about SQL execution
  • 27. © 2014 MapR Technologies 29 But … • Spark can optimize across entire program – This often leads to ~2x speed advantage • Spark has much more flexible memory structures – This can lead to much less memory pressure • Spark has much more flexible RDD life-cycle – RDD’s can be cached, persisted or simply recomputed as necessary • Spark is not all about SQL execution
  • 28. © 2014 MapR Technologies 30 The Really Big Differences • Drill focuses heavily on secure, multi-tenant access to data – Strong impersonation semantics – Cascading rights via views – Queries co-exist in a cluster and reserve only their momentary resource requirements • Spark focuses heavily on fully integrated execution models – Any spark function works with (almost) any RDD’s – Memory residency of RDD’s is the highest goal
  • 29. © 2014 MapR Technologies 31 Drill security ➢ End to end security from BI tools to Hadoop ➢ Standard based PAM Authentication ➢ 2 level user Impersonation ➢ Fine-grained row and column level access control with Drill Views – no centralized security repository required
  • 30. © 2014 MapR Technologies 32 Granular security permissions through Drill views Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Owner Admins Permission Admins Business Analyst Data Scientist Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist View (/views/maskedcards.view.drill) Not a physical data copy Name City State Dave San Jose CA John Boulder CO Business Analyst View Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists
  • 31. © 2014 MapR Technologies 33 Ownership Chaining • Combine Self Service Exploration with Data Governance Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist (/views/V_Scientist) Jane (Read) John (Owner) Name City State Dave San Jose CA John Boulder CO Analyst(/views/V_Analyst) Jack (Read) Jane(Owner) RAWFILEV_ScientistV_Analyst Does Jack have access to V_Analyst? ->YES Who is the owner of V_Analyst? ->Jane Drill accesses V_Analyst as Jane (Impersonation hop 1) Does Jane have access to V_Scientist ? -> YES Who is the owner of V_Scientist? ->John Drill accesses V_Scientist as John (Impersonation hop 2) John(Owner) Does John have permissions on raw file? -> YES Who is the owner of raw file? ->John Drill accesses source file as John (no impersonation here) Jack queries the view V_Analyst *Ownership chain length (# hops) is configurable Ownership chaining Access path
  • 32. © 2014 MapR Technologies 34 But was that the right question?
  • 33. © 2014 MapR Technologies 35 Unification is Feasible • It is relatively easy to build a DrillContext in Spark – compare to SqlContext • Define Datasets as Drill data sources and sinks – Drill runs at the same time as Spark • Orchestrate transport of Spark data to/from Drill • Cost of transport is remarkably small
  • 34. © 2014 MapR Technologies 36 What does the Spark and Drill integration look like Features at a glance: • Use Drill as an input to Spark • Query Spark RDDs via Drill and create data pipelines Disk (DFS) Memory RDD Files Files
  • 35. © 2014 MapR Technologies 37 Is unification valuable?
  • 36. © 2014 MapR Technologies 38 Example of Unification Callers Universe Towers cdr data
  • 37. © 2014 MapR Technologies 39 Simple Session Protocol • Calls started at random intervals • During calls, reconnection is done periodically idle connect HELLO FAIL TIME OUT active END CONNECT END HELLO start SETUP • Many log events are buffered and sent to current tower during active state
  • 38. © 2014 MapR Technologies 40 The Resulting Data • Signal strength reports – Tower, timestamp, rank, caller, caller location*, signal strength • Tower log events: HELLO, FAIL, CONNECT, END • Call end • Note that data for one tower is often received by another due to caller buffering to diagnostic data *Location isn’t quite location … poetic license applied for
  • 39. © 2014 MapR Technologies 41 What can we do with it?
  • 40. © 2014 MapR Technologies 42 Baby Steps • What does signal propagation look like? select x, y, signal from cdr_stream where tower = 3 • Plot results to get a map of signal strength around a tower
  • 41. © 2014 MapR Technologies 43 Baby Steps • What does tower coverage look like? select x, y from cdr_stream where tower = 3 and event_type = ‘CONNECT’. • Plot results to get a map of coverage area for a tower
  • 42. © 2014 MapR Technologies 44 What about anomaly detection?
  • 43. © 2014 MapR Technologies 45 Detecting Tower Loss It’s important to know if traffic is stopped or delayed because of a problem… But events from towers come at irregular intervals How long after the last event should you begin to worry?
  • 44. © 2014 MapR Technologies 46 Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values – This shows up as a change in interval • Want alert as soon as possible
  • 45. © 2014 MapR Technologies 47 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  • 46. © 2014 MapR Technologies 48 But in the real world, event rates often change
  • 47. © 2014 MapR Technologies 49 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 48. © 2014 MapR Technologies 50 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 49. © 2014 MapR Technologies 51 After Rate Correction 0 1 2 3 4 0246810 t (days) dt/rate 99.9%−ile 99.99%−ile
  • 50. © 2014 MapR Technologies 52 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 51. © 2014 MapR Technologies 53 Propagation Anomalies • What happens when something shadows part of the coverage field? – Can happen in urban areas with a construction crane • Can solve heuristically – Subtract from reference image composed by long term averages – Doesn’t deal well with weak signal regions and low S/N • Can solve probabilistically – Compute anomaly for each measurement, use mean of log(p)
  • 52. © 2014 MapR Technologies 54
  • 53. © 2014 MapR Technologies 55
  • 54. © 2014 MapR Technologies 56 Variable Signal/Noise Makes Heuristic Tricky Far from the transmitter, received signal is dominated by noise. This makes subtraction of average value a bad algorithm.
  • 55. © 2014 MapR Technologies 57 Other Issues • Finding anomalies in coverage area is similar tricky • Coverage area is roughly where tower signal strength is higher than neighbors • Except for fuzziness due to hand-off delays • Except for bias due to large-scale caller motions – Rush hour – Event mobs
  • 56. © 2014 MapR Technologies 58 Simple Answer for Propagation Anomalies • Cluster signal strength reports • Cluster locations using k-means, large k • Model report rate anomaly using discrete event models • Model signal strength anomaly using percentile model • Trade larger k against higher report rates, faster detection • Overall anomaly is sum of individual log(p) anomalies
  • 57. © 2014 MapR Technologies 59 Coverage Areas
  • 58. © 2014 MapR Technologies 60 Just One Tower
  • 59. © 2014 MapR Technologies 61 Cluster Reports for That Tower
  • 60. © 2014 MapR Technologies 62 Cluster Reports for That Tower 1 2 3 4 5 6 7 8 9
  • 61. © 2014 MapR Technologies 63 General Dataflow Group by tower, filter data (SQL) k-means cluster (ML LIB) Split data (SQL) Location model (Java) Mark cluster (ML LIB) Rate detection per cluster
  • 62. © 2014 MapR Technologies 64 Summary • Drill and Spark provide healthy competition in Apache • Over time, they have converged in many respects – But important distinctions remain • Projects can work together to share key technology – Apache Arrow … started as off-shoot of Drill, now has >12 major projects as participants, including Spark • Systems can work together even more deeply – DrillContext makes integration first class
  • 63. © 2014 MapR Technologies 65 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 64. © 2014 MapR Technologies 66 Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams
  • 65. © 2014 MapR Technologies 67 Thank you for coming today!
  • 66. © 2014 MapR Technologies 68 …helping you put data technology to work ● Find answers ● Ask technical questions ● Join on-demand training course discussions ● Follow release announcements ● Share and vote on product ideas ● Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com

Notes de l'éditeur

  1. ELLEN: set up
  2. Talk track: This is what it looks like to have events such as those on website that come in at randomized times (people come when they want to) but the underlying average rate in this case is constant, in other words, a fairly steady stream of traffic. This looks at lot like the first signal we talked about: a randomized but even signal… We can use t-digest on it to set thresholds, everything works just grand. (Like radio activity Geiger counter clicks)
  3. Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  4. Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  5. Talk track: You need a rate predictor Ellen: sometimes simple is good enough