More Related Content Similar to Hug france-2012-12-04 (20) More from Ted Dunning (20) Hug france-2012-12-042. My Background
Startups
– Aptex, MusicMatch, ID Analytics, Veoh
– Big data since before big
Open source
– since the dark ages before the internet
– Mahout, Zookeeper, Drill
– bought the beer at first HUG
MapR
Founding member of Apache Drill
©MapR Technologies - Confidential 2
3. MapR Technologies
The open enterprise-grade distribution for Hadoop
– Easy, dependable and fast
– Open source with standards-based extensions
MapR is deployed at 1000’s of companies
– From small Internet startups to the world’s largest enterprises
MapR customers analyze massive amounts of data:
– Hundreds of billions of events daily
– 90% of the world’s Internet population monthly
– $1 trillion in retail purchases annually
MapR has partnered with Google to provide Hadoop on Google Compute
Engine
©MapR Technologies - Confidential 3
4. Agenda
What?
– What exactly does Drill do?
Why?
– Why do we need Apache Drill?
Who?
– Who is doing this?
How?
– How does Drill work inside?
Conclusion
– How can you help?
– Where can you find out more?
©MapR Technologies - Confidential 4
5. Apache Drill Overview
Drill overview
– Low latency interactive queries
– Standard ANSI SQL support
Open-Source
– 100’s involved across US and Europe
– Community consensus on API, functionality
PMC expects first version late this quarter
– Several components already developed
©MapR Technologies - Confidential 5
6. Big Data Processing – Hadoop
Batch processing
Query runtime Minutes to hours
Data volume TBs to PBs
Programming MapReduce
model
Users Developers
Google project MapReduce
Open source Hadoop
project MapReduce
©MapR Technologies - Confidential 6
7. Big Data Processing – Hadoop and Storm
Batch processing Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming MapReduce DAG
model (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source Hadoop Storm or Apache S4
project MapReduce
©MapR Technologies - Confidential 7
8. Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming MapReduce DAG
model (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source Hadoop Storm and S4
project MapReduce
©MapR Technologies - Confidential 8
9. Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model (ad hoc) (pre-programmed)
Users Developers Analysts and Developers
developers
Google project MapReduce
Open source Hadoop Storm and S4
project MapReduce
©MapR Technologies - Confidential 9
10. Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model
Users Developers Analysts and Developers
developers
Google project MapReduce Dremel
Open source Hadoop Storm and S4
project MapReduce
©MapR Technologies - Confidential 10
11. Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model
Users Developers Analysts and Developers
developers
Google project MapReduce Dremel
Open source Hadoop Storm and S4
project MapReduce
Introducing Apache Drill
©MapR Technologies - Confidential 11
12. Latency Matters
Ad-hoc analysis with interactive tools
Real-time dashboards
Event/trend detection and analysis
– Network intrusions
– Fraud
– Failures
©MapR Technologies - Confidential 12
13. Nested Query Languages
DrQL
– SQL-like query language for nested data
– Compatible with Google BigQuery/Dremel
• BigQuery applications should work with Drill
– Designed to support efficient column-based processing
• No record assembly during query processing
Mongo Query Language
– {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}
Other languages/programming models can plug in
©MapR Technologies - Confidential 13
14. Nested Data Model
The data model in Dremel is Protocol Buffers
– Nested
– Schema
Apache Drill is designed to support multiple data models
– Schema: Protocol Buffers, Apache Avro, …
– Schema-less: JSON, BSON, …
Flat records are supported as a special case of nested data
– CSV, TSV, …
Avro IDL JSON
enum Gender { {
MALE, FEMALE "name": "Srivas",
} "gender": "Male",
"followers": 100
record User { }
string name; {
Gender gender; "name": "Raina",
long followers; "gender": "Female",
} "followers": 200,
"zip": "94305"
}
©MapR Technologies - Confidential 14
15. Extensibility
Nested query languages
– Pluggable model
– DrQL
– Mongo Query Language
– Cascading
Distributed execution engine
– Extensible model (eg, Dryad)
– Low-latency
– Fault tolerant
Nested data formats
– Pluggable model
– Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV)
– Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)
Scalable data sources
– Pluggable model
– Hadoop
– HBase
©MapR Technologies - Confidential 15
16. Design Principles
Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources
Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)
©MapR Technologies - Confidential 16
18. Architecture
Only the execution engine knows the physical attributes of the cluster
– # nodes, hardware, file locations, …
Public interfaces enable extensibility
– Developers can build parsers for new query languages
– Developers can provide an execution plan directly
Each level of the plan has a human readable representation
– Facilitates debugging and unit testing
©MapR Technologies - Confidential 18
19. Execution Engine Layers
Drill execution engine has two layers
– Operator layer is serialization-aware
• Processes individual records
– Execution layer is not serialization-aware
• Processes batches of records (blobs)
• Responsible for communication, dependencies and fault tolerance
©MapR Technologies - Confidential 19
20. DrQL Example
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS
Cnt,
Name.Url + ',' + Name.Language.Code AS
Str
FROM t
WHERE REGEXP(Name.Url, '^http')
AND DocId < 20;
©MapR Technologies - Confidential 20 * Example from the Dremel paper
21. Query Components
Query components:
– SELECT
– FROM
– WHERE
– GROUP BY
– HAVING
– (JOIN)
Key logical operators:
– Scan
– Filter
– Aggregate
– (Join)
©MapR Technologies - Confidential 21
22. Logical Plan
scan-json "table-1"
filter exp1
flatten
aggregate exp2
©MapR Technologies - Confidential 22
23. Logical Plan Syntax
{op: "sequence",
do: [
{op: "scan",
source: "table-1.json"
selection: "*"
},
{op: "filter",
expr: <expr>
},
{op: "flatten",
expr: <expr>,
drop: "false"
},
{op: "aggregate",
type: repeat,
keys: [<name>,...],
aggregations: [
{ref: <name>, expr: <aggexpr> },...
]
}
]
}
©MapR Technologies - Confidential 23
24. Representing a DAG
18
aggregate exp2
19
{ @id: 19, op: "aggregate",
input: 18,
type: <simple|running|repeat>,
keys: [<name>,...],
aggregations: [
{ref: <name>, expr: <aggexpr> },...
]
}
©MapR Technologies - Confidential 24
25. Multiple Inputs
id 23 24 id
cogroup
{ @id: 25, op: "cogroup",
groupings: [
25 {ref: 23, expr: “id”}, {ref:
24, expr: “id”}
]
}
©MapR Technologies - Confidential 25
26. Scan Operators
• Drill supports multiple data formats by having per-format scan operators
• Queries involving multiple data formats/sources are supported
• Fields and predicates can be pushed down into the scan operator
• Scan operators may have adaptive side-effects (database cracking)
• Produce ColumnIO from RecordIO
• Google PowerDrill stores materialized expressions with the data
Scan with schema Scan without schema
Operator Protocol Buffers JSON-like (MessagePack)
output
Supported ColumnIO (column-based protobuf/Dremel) JSON
data formats RecordIO (row-based protobuf) HBase
CSV
SELECT … ColumnIO(proto URI, data URI) Json(data URI)
FROM … RecordIO(proto URI, data URI) HBase(table name)
©MapR Technologies - Confidential 26
27. Design Principles
Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources
Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)
©MapR Technologies - Confidential 27
28. Hadoop Integration
Hadoop data sources
– Hadoop FileSystem API (HDFS/MapR-FS)
– HBase
Hadoop data formats
– Apache Avro
– RCFile
MapReduce-based tools to create column-based formats
Table registry in HCatalog
Run long-running services in YARN
©MapR Technologies - Confidential 28
29. Get Involved!
Download these slides
– http://www.mapr.com/company/events/hug-france-12-04-2012
Join the project
– drill-dev-subscribe@incubator.apache.org
– #apachedrill
Contact me:
– tdunning@maprtech.com
– tdunning@apache.org
– ted.dunning@maprtech.com
– @ted_dunning
Join MapR
– jobs@mapr.com
©MapR Technologies - Confidential 29
Editor's Notes No graphic changes….Note for Bullet changes:Open Source-- Community consensusAPIAvailable for all Distributions-- Likely to support theseCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon queryPig as wellPluggabilityData formatQuery languageSomething 6-9 months alpha qualityCommunity driven, I can’t speak for projectMapRFS gives better chunk size controlNFS support may make small test drivers easierUnified namespace will allow multi-cluster accessMight even have drill component that autoformats dataRead only model Protocol buffers are conceptual data modelWill support multiple data modelsWill have to define a way to explain data format (filtering, fields, etc)Schema-less will have perf penaltyHbase will be one format Note: we have an already partially built execution engine Example query that Drill should supportNeed to talk more here about what Dremel does Be prepared for Apache questionsCommitter vs committee vs contributorIf can’t answer question, ask them to answer and contributeLisa - Need landing pageReferences to paper and such at end