More Related Content Similar to PhillyDB Talk - Beyond Batch (20) PhillyDB Talk - Beyond Batch1. Beyond Batch
Drill & Storm
Brad Anderson
©MapR Technologies
2. whoami
• Brad Anderson
• Solutions Architect at MapR (Atlanta)
• ATLHUG co-chair
• „boorad‟ most places (twitter, github)
• banderson@maprtech.com
©MapR Technologies
3. MapR - Faster and More Scalable
Benchmark MapR 2.1.1 CDH 4.1.1 MapR Speed
Increase
Terasort (1x replication, compression disabled)
Total 13m 35s 26m 6s 1.9x
Map 7m 58s 21m 8s 2.7x
Reduce 13m 32s 23m 37s 1.7x
DFSIO throughput/node
MapR/Googl Apache
Read 1003 MB/s 656 MB/s 1.5x e Hadoop
Write 924 MB/s 654 MB/s 1.4x Time 54 sec 62 sec
YCSB (50% read, 50% update) Nodes 1,003 1,460
Throughput 36,584.4 op/s 12,500.5 op/s 2.9x Disks 1,003 5,840
Cores 4,012 11,680
Runtime 3.80 hr 11.11 hr 2.9x
YCSB (95% read, 5% update)
Throughput 24,704.3 op/s 10,776.4 op/s 2.3x
Runtime 0.56 hr 1.29 hr 2.3x
Benchmark hardware configuration:
10 servers, 12 x 2 cores (2.4 GHz), 12 x 2TB, 48 GB, 1 x 10GbE
4. Beyond Batch
HBase & M7
Apache Drill
Storm
Solr & Elastic Search
©MapR Technologies
6. Big Data Picture
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model MapReduce Queries DAG
Users Developers Analysts and Developers Developers
Google project MapReduce Dremel
Open source project Hadoop MapReduce Storm, S4
Apache Drill
©MapR Technologies
7. Interactive SQL Initiatives for Hadoop
SQL based OLTP SQL based
analytics
Real-time
interactive queries
Impala*
Real-time SQL conversion
interactive queries to MapReduce
* Does not work with other distributions
9. Google Dremel
• Interactive analysis of large-scale datasets
• Trillion records at interactive speeds
• Complementary to MapReduce
• Used by thousands of Google employees
• Paper published at VLDB 2010
• Model
• Nested data model with schema
• Most data at Google is stored/transferred in Protocol Buffers
• SQL-like query language with nested data support
• Implementation
• Column-based storage and processing
• In-situ data access (GFS and Bigtable)
• Tree architecture as in Web search (and databases)
©MapR Technologies
10. Google BigQuery
• Hosted Dremel (Dremel as a Service)
• CLI (bq) and Web UI
• Import data from Google Cloud Storage or local files
• Files must be in CSV format
• Nested data not supported [yet] except built-in datasets
• Schema definition required
©MapR Technologies
11. Drill Design Principles
Flexible Easy
•Pluggable query languages •Unzip and run
•Extensible execution engine •Zero configuration
•Pluggable data formats •Reverse DNS not needed
• Columns and Rows •IP addresses can change
• Schema and Schema-less •Clear and concise log messages
•Pluggable data sources
Fast Dependable
•C/C++ core with Java support •No SPOF
• Google C++ style guide •Instant recovery from crashes
•Min latency and max throughput
(limited only by hardware)
©MapR Technologies
12. DrQL Example
DocId: 10
Links
Forward: 20 SELECT DocId AS Id,
Forward: 40 COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Forward: 60 Name.Url + ',' + Name.Language.Code AS Str
Name FROM t
Language WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
Code: 'en-us'
Country: 'us'
Language
Code: 'en' Id: 10
Url: 'http://A' Name
Name Cnt: 2
Url: 'http://B' Language
Name Str: 'http://A,en-us'
Language Str: 'http://A,en'
Code: 'en-gb' Name
Country: 'gb' Cnt: 0
©MapR Technologies
* Example from the Dremel paper
14. Extensibility
• Nested query languages
• DrQL
• Mongo Query Language
• Cascading, Hive, Pig
• Distributed execution engine
• Extensible model (eg, Dryad)
• Low-latency
• Fault tolerant
©MapR Technologies
15. Extensibility
Nested data formats
• Pluggable model
• Column-based (ColumnIO/Dremel, Trevni, RCFile)
• Row-based (RecordIO, Avro, JSON, CSV)
• Schema (Protocol Buffers, Avro, CSV)
• Schema-less (JSON, BSON)
Scalable data sources
• Pluggable model
• Hadoop
• HBase
©MapR Technologies
16. Drill Architecture
Client Cluster
Execu2on4
Driver Parser Compiler Data4
Source
Engine
Query4
(text) AST4
(text) Plan4
(text) API
Public interfaces enable extensibility
– Add a new query language by implementing a parser
– Add a new data source by implementing an API
– Provide a plan directly to the execution engine to control execution
Each level of the plan has a human readable representation
– Facilitates debugging and development
17. Drill Architecture (2)
DrQL%
Clients
Driver Drill%
Query%
Servers
DrQL%
Parser
Compiler Drill%
Worker
Drill%Worker
Cascading/Pig/...%
Clients Other%
Parser
Drill%Worker
Intermediate%
Driver
Parser
18. Query Components
• Query components:
• SELECT
• FROM
• WHERE
• GROUP BY
• HAVING
• JOIN
• Key logical operators:
• Scan
• Filter
• Aggregate
• Join
©MapR Technologies
19. Scan Operators
• Drill supports multiple data formats by having per-format scan operators
• Queries involving multiple data formats/sources are supported
• Fields and predicates can be pushed down into the scan operator
• Scan operators may have adaptive side-effects (database cracking)
• Produce ColumnIO from RecordIO
• Google PowerDrill stores materialized expressions with the data
Scan with schema Scan without schema
Operator output Protocol Buffers JSON-like (MessagePack)
ColumnIO (column-based protobuf/Dremel)
Supported data JSON
RecordIO (row-based protobuf)
formats HBase
CSV
SELECT … ColumnIO(proto URI, data URI) Json(data URI)
FROM … RecordIO(proto URI, data URI) HBase(table name)
©MapR Technologies
20. Execution Engine Layers
• Drill execution engine has two layers
• Operator layer is serialization-aware
• Processes individual records
• Execution layer is not serialization-aware
• Processes batches of records (blobs)
• Responsible for communication, dependencies and fault tolerance
©MapR Technologies
21. Hadoop Integration
• Hadoop data sources
• Hadoop FileSystem API (HDFS/MapR-FS)
• HBase
• Hadoop data formats
• Apache Avro
• RCFile
• MapReduce-based tools to create column-based formats
• Table registry in HCatalog
• Run long-running services in YARN
©MapR Technologies
23. Momentum
Over 200 people on the Drill mailing list
Over 200 members of the Bay Area Drill User Group
Over 100 participants the first meetup in Sunnyvale, CA
• MapR, Cisco, Intel, eBay, Google, Yahoo!, LinkedIn, …
Drill meetups across the US and Europe
OpenDremel team and source code merged with Apache Drill
Simba Technologies – ODBC inventor developing a Drill
ODBC driver
• Tableau, MicroStrategy, Excel, SAP Crystal Reports, …
27. Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than
message passing
“Just works”
©MapR Technologies
29. Streams
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Unbounded sequence of tuples
©MapR Technologies
30. Spouts
Source of streams
©MapR Technologies
31. Spouts
public interface ISpout extends Serializable {
void open(Map conf,
TopologyContext context,
SpoutOutputCollector collector);
void close();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
}
©MapR Technologies
32. Bolts
Tuple Tuple Tuple Tuple
Processes input streams and produces new streams
©MapR Technologies
33. Bolts
public class DoubleAndTripleBolt extends BaseRichBolt {
private OutputCollectorBase _collector;
public void prepare(Map conf,
TopologyContext context,
OutputCollectorBase collector) {
_collector = collector;
}
public void execute(Tuple input) {
int val = input.getInteger(0);
_collector.emit(input, new Values(val*2, val*3));
_collector.ack(input);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("double", "triple"));
}
}
©MapR Technologies
34. Topologies
Network of spouts and bolts
©MapR Technologies
36. Trident
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"),
new Split(),
new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(),
new Fields("count"))
.parallelismHint(6);
©MapR Technologies
38. Spouts
Kafka (with transactions)
Kestrel
JMS
AMQP
©MapR Technologies
40. Storm
realtime
processes
Apps
Queue
Raw
Busines
Data
s
Value
Hadoop
Parallel Cluster Ingest
batch
processes
©MapR Technologies
41. Storm
realtime
processes
Apps
TailSpout
Queue
Raw
Busines
Data
s
Georg
Value
Hadoop
batch
processes
©MapR Technologies
43. Get Involved!
• Slides
• http://slideshare.net/boorad/phillydb
• Join the Apache Drill mailing list
• drill-dev-subscribe@incubator.apache.org
• Watch TailSpout & Georg development
• https://github.com/{tdunning | boorad | rlankenau}/mapr-spout
• Join MapR
• jobs@mapr.com
• banderson@maprtech.com
• @boorad
©MapR Technologies